2025-05-30-12-06

A Large Language Model-Enabled Control Architecture for Dynamic Resource Capability Exploration in Multi-Agent Manufacturing Systems

Abstract

arXiv:2505.22814v1 Announce Type: new Abstract: Manufacturing environments are becoming more complex and unpredictable due to factors such as demand variations and shorter product lifespans. This complexity requires real-time decision-making and adaptation to disruptions. Traditional control approaches highlight the need for advanced control strategies capable of overcoming unforeseen challenges, as they demonstrate limitations in responsiveness within dynamic industrial settings. Multi-agent systems address these challenges through decentralization of decision-making, enabling systems to respond dynamically to operational changes. However, current multi-agent systems encounter challenges related to real-time adaptation, context-aware decision-making, and the dynamic exploration of resource capabilities. Large language models provide the possibility to overcome these limitations through context-aware decision-making capabilities. This paper introduces a large language model-enabled control architecture for multi-agent manufacturing systems to dynamically explore resource capabilities in response to real-time disruptions. A simulation-based case study demonstrates that the proposed architecture improves system resilience and flexibility. The case study findings show improved throughput and efficient resource utilization compared to existing approaches.

摘要

制造环境因需求波动和产品生命周期缩短等因素正变得日益复杂且难以预测。这种复杂性要求系统具备实时决策能力及对突发干扰的适应性。传统控制方法在动态工业环境中的响应能力存在局限，凸显了对先进控制策略的需求。多智能体系统通过决策分散化应对这些挑战，使系统能够动态响应运行变化。然而现有多智能体系统仍面临实时适应、情境感知决策及资源能力动态探索等挑战。大语言模型凭借其情境感知决策能力，为突破这些限制提供了可能。本文提出一种基于大语言模型的多智能体制造系统控制架构，用于动态探索资源能力以应对实时干扰。基于仿真的案例研究表明，该架构提升了系统韧性与灵活性。研究发现，相较于现有方法，该架构实现了吞吐量提升与资源利用效率优化。

Profiling and optimization of multi-card GPU machine learning jobs

Abstract

arXiv:2505.22905v1 Announce Type: new Abstract: The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing, are analyzed. Selected optimization strategies are studied in detail, highlighting the related challenges and advantages of their implementation. Furthermore, the impact of different performance improvement techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of large language models is investigated. Experimental results illustrate how the nature of the task affects the iteration time in a multiprocessor environment, VRAM utilization, and overall memory transfers. Test scenarios are evaluated on the modern NVIDIA H100 GPU architecture.

摘要

机器学习方法的有效性和效率至关重要，尤其在结果质量与计算成本方面。本文探讨了不同的模型优化技术，对关键性能指标进行了全面分析。研究针对图像识别任务提出了多种并行化策略，这些策略适配不同的硬件与软件配置，包括分布式数据并行和分布式硬件处理方案。我们对选定的优化策略进行了详细研究，重点阐述了其实施过程中的相关挑战与优势。此外，本文还探究了不同性能提升技术（DPO、LoRA、QLoRA和QAT）对大型语言模型调优过程的影响。实验结果表明任务特性如何影响多处理器环境中的迭代时间、显存利用率以及整体内存传输量。所有测试场景均在现代NVIDIA H100 GPU架构上进行评估。

Be.FM: Open Foundation Models for Human Behavior

Abstract

arXiv:2505.23058v1 Announce Type: new Abstract: Despite their success in numerous fields, the potential of foundation models for modeling and understanding human behavior remains largely unexplored. We introduce Be.FM, one of the first open foundation models designed for human behavior modeling. Built upon open-source large language models and fine-tuned on a diverse range of behavioral data, Be.FM can be used to understand and predict human decision-making. We construct a comprehensive set of benchmark tasks for testing the capabilities of behavioral foundation models. Our results demonstrate that Be.FM can predict behaviors, infer characteristics of individuals and populations, generate insights about contexts, and apply behavioral science knowledge.

摘要

尽管基础模型在众多领域取得了成功，但其在人类行为建模与理解方面的潜力仍亟待探索。本文提出Be.FM——首个面向人类行为建模的开放基础模型之一。该模型基于开源大语言框架构建，并通过多源行为数据微调，能够用于理解和预测人类决策行为。我们建立了一套完整的基准测试任务体系以评估行为基础模型的性能。实验结果表明，Be.FM具备行为预测、个体与群体特征推断、情境洞察生成以及行为科学知识应用等能力。

Decomposing Elements of Problem Solving: What "Math" Does RL Teach?

Abstract

arXiv:2505.22756v1 Announce Type: new Abstract: Mathematical reasoning tasks have become prominent benchmarks for assessing the reasoning capabilities of LLMs, especially with reinforcement learning (RL) methods such as GRPO showing significant performance gains. However, accuracy metrics alone do not support fine-grained assessment of capabilities and fail to reveal which problem-solving skills have been internalized. To better understand these capabilities, we propose to decompose problem solving into fundamental capabilities: Plan (mapping questions to sequences of steps), Execute (correctly performing solution steps), and Verify (identifying the correctness of a solution). Empirically, we find that GRPO mainly enhances the execution skill-improving execution robustness on problems the model already knows how to solve-a phenomenon we call temperature distillation. More importantly, we show that RL-trained models struggle with fundamentally new problems, hitting a 'coverage wall' due to insufficient planning skills. To explore RL's impact more deeply, we construct a minimal, synthetic solution-tree navigation task as an analogy for mathematical problem-solving. This controlled setup replicates our empirical findings, confirming RL primarily boosts execution robustness. Importantly, in this setting, we identify conditions under which RL can potentially overcome the coverage wall through improved exploration and generalization to new solution paths. Our findings provide insights into the role of RL in enhancing LLM reasoning, expose key limitations, and suggest a path toward overcoming these barriers. Code is available at https://github.com/cfpark00/RL-Wall.

摘要

数学推理任务已成为评估大语言模型（LLM）推理能力的重要基准，尤其是采用强化学习（RL）方法（如GRPO）后，模型性能显著提升。然而，仅凭准确率指标无法支持对能力的细粒度评估，也难以揭示模型内化了哪些问题解决技能。为深入理解这些能力，我们提出将问题解决分解为三项基础能力：规划（将问题映射为解题步骤序列）、执行（正确执行解题步骤）和验证（判断解答的正确性）。实证研究发现，GRPO主要增强了执行能力——提升模型对已知问题解的稳健性，这一现象我们称之为"温度蒸馏"。更重要的是，我们发现经过RL训练的模型在面对全新问题时表现欠佳，由于规划能力不足而遭遇"覆盖墙"。为深入探究RL的影响，我们构建了一个最小化的合成解树导航任务作为数学问题解决的类比。这一受控实验复现了实证结果，证实RL主要提升执行稳健性。值得注意的是，在此设定下，我们发现了RL可能通过改进探索能力和对新解路径的泛化来突破覆盖墙的条件。本研究揭示了RL在增强LLM推理中的作用，指出现有关键局限，并为突破这些障碍提供了路径。代码详见https://github.com/cfpark00/RL-Wall。

Design and testing of an agent chatbot supporting decision making with public transport data

Abstract

arXiv:2505.22698v1 Announce Type: new Abstract: Assessing the quality of public transportation services requires the analysis of large quantities of data on the scheduled and actual trips and documents listing the quality constraints each service needs to meet. Interrogating such datasets with SQL queries, organizing and visualizing the data can be quite complex for most users. This paper presents a chatbot offering a user-friendly tool to interact with these datasets and support decision making. It is based on an agent architecture, which expands the capabilities of the core Large Language Model (LLM) by allowing it to interact with a series of tools that can execute several tasks, like performing SQL queries, plotting data and creating maps from the coordinates of a trip and its stops. This paper also tackles one of the main open problems of such Generative AI projects: collecting data to measure the system's performance. Our chatbot has been extensively tested with a workflow that asks several questions and stores the generated query, the retrieved data and the natural language response for each of them. Such questions are drawn from a set of base examples which are then completed with actual data from the database. This procedure yields a dataset for the evaluation of the chatbot's performance, especially the consistency of its answers and the correctness of the generated queries.

摘要

评估公共交通服务质量需要分析大量关于计划行程与实际行程的数据，以及列明各项服务需满足质量约束的文档。对于大多数用户而言，使用SQL查询来检索此类数据集、组织并可视化数据可能相当复杂。本文提出一种聊天机器人，为用户提供与这些数据集交互的友好工具以支持决策。该工具基于智能体架构，通过让核心大语言模型（LLM）与一系列工具交互来扩展其功能，这些工具可执行多项任务，如运行SQL查询、绘制数据图表，以及根据行程及其站点的坐标生成地图。本文还解决了此类生成式人工智能项目的主要开放性问题之一：收集数据以衡量系统性能。我们通过特定工作流程对聊天机器人进行了全面测试，该流程会提出若干问题，并存储每个问题对应的生成查询、检索数据及自然语言回答。这些问题源自一组基础案例，随后通过数据库中的实际数据进行补充。此流程生成了一个用于评估聊天机器人性能的数据集，特别是其回答的一致性和生成查询的正确性。

TailorSQL: An NL2SQL System Tailored to Your Query Workload

Abstract

arXiv:2505.23039v1 Announce Type: new Abstract: NL2SQL (natural language to SQL) translates natural language questions into SQL queries, thereby making structured data accessible to non-technical users, serving as the foundation for intelligent data applications. State-of-the-art NL2SQL techniques typically perform translation by retrieving database-specific information, such as the database schema, and invoking a pre-trained large language model (LLM) using the question and retrieved information to generate the SQL query. However, existing NL2SQL techniques miss a key opportunity which is present in real-world settings: NL2SQL is typically applied on existing databases which have already served many SQL queries in the past. The past query workload implicitly contains information which is helpful for accurate NL2SQL translation and is not apparent from the database schema alone, such as common join paths and the semantics of obscurely-named tables and columns. We introduce TailorSQL, a NL2SQL system that takes advantage of information in the past query workload to improve both the accuracy and latency of translating natural language questions into SQL. By specializing to a given workload, TailorSQL achieves up to 2 $\times$ improvement in execution accuracy on standardized benchmarks.

摘要

NL2SQL（自然语言转SQL）技术将自然语言问题转化为SQL查询语句，使非技术用户能够访问结构化数据，成为智能数据应用的基础。最先进的NL2SQL技术通常通过检索数据库特定信息（如数据库模式）并调用预训练的大语言模型（LLM），结合问题与检索信息生成SQL查询。然而，现有NL2SQL技术忽略了实际场景中的关键机会：NL2SQL通常应用于已有历史SQL查询记录的现有数据库。这些过往查询负载隐含了有助于提升NL2SQL翻译准确性的信息（如常见连接路径及命名晦涩的表/列语义），这些信息无法仅通过数据库模式获取。我们提出TailorSQL系统，利用历史查询负载中的信息来同时提高自然语言转SQL的准确性与响应速度。通过针对特定工作负载进行优化，TailorSQL在标准测试基准上实现了最高2倍的执行准确率提升。

Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness

Abstract

arXiv:2505.22960v1 Announce Type: new Abstract: The remarkable growth in large language model (LLM) capabilities has spurred exploration into multi-agent systems, with debate frameworks emerging as a promising avenue for enhanced problem-solving. These multi-agent debate (MAD) approaches, where agents collaboratively present, critique, and refine arguments, potentially offer improved reasoning, robustness, and diverse perspectives over monolithic models. Despite prior studies leveraging MAD, a systematic understanding of its effectiveness compared to self-agent methods, particularly under varying conditions, remains elusive. This paper seeks to fill this gap by conceptualizing MAD as a test-time computational scaling technique, distinguished by collaborative refinement and diverse exploration capabilities. We conduct a comprehensive empirical investigation comparing MAD with strong self-agent test-time scaling baselines on mathematical reasoning and safety-related tasks. Our study systematically examines the influence of task difficulty, model scale, and agent diversity on MAD's performance. Key findings reveal that, for mathematical reasoning, MAD offers limited advantages over self-agent scaling but becomes more effective with increased problem difficulty and decreased model capability, while agent diversity shows little benefit. Conversely, for safety tasks, MAD's collaborative refinement can increase vulnerability, but incorporating diverse agent configurations facilitates a gradual reduction in attack success through the collaborative refinement process. We believe our findings provide critical guidance for the future development of more effective and strategically deployed MAD systems.

摘要

大语言模型（LLM）能力的显著增长推动了对多智能体系统的探索，其中辩论框架作为一种增强问题解决能力的有前景途径而崭露头角。这些多智能体辩论（MAD）方法通过智能体协作提出、批判和完善论点，相较于单一模型可能提供更优的推理能力、鲁棒性及多元视角。尽管已有研究利用MAD方法，但其与单智能体方法相比的有效性，尤其是在不同条件下的系统性理解仍不明确。本文通过将MAD概念化为一种测试时计算扩展技术（其特点在于协作优化和多样化探索能力），试图填补这一空白。我们在数学推理和安全性相关任务上，对MAD与强基线单智能体测试时扩展方法进行了全面实证比较。研究系统考察了任务难度、模型规模和智能体多样性对MAD性能的影响。关键发现表明：在数学推理任务中，MAD相较单智能体扩展优势有限，但随着问题难度增加或模型能力下降，其有效性会提升，而智能体多样性则未显现明显益处；相反在安全任务中，MAD的协作优化可能增加系统脆弱性，但通过引入多样化智能体配置，协作优化过程可逐步降低攻击成功率。我们相信这些发现为未来开发更有效且具备战略部署价值的MAD系统提供了关键指导。

Speeding up Model Loading with fastsafetensors

Abstract

arXiv:2505.23072v1 Announce Type: new Abstract: The rapid increases in model parameter sizes introduces new challenges in pre-trained model loading. Currently, machine learning code often deserializes each parameter as a tensor object in host memory before copying it to device memory. We found that this approach underutilized storage throughput and significantly slowed down loading large models with a widely-used model file formats, safetensors. In this work, we present fastsafetensors, a Python library designed to optimize the deserialization of tensors in safetensors files. Our approach first copies groups of on-disk parameters to device memory, where they are directly instantiated as tensor objects. This design enables further optimization in low-level I/O and high-level tensor preprocessing, including parallelized copying, peer-to-peer DMA, and GPU offloading. Experimental results show performance improvements of 4.8x to 7.5x in loading models such as Llama (7, 13, and 70 billion parameters), Falcon (40 billion parameters), and the Bloom (176 billion parameters).

摘要

模型参数规模的快速增长给预训练模型加载带来了新的挑战。当前机器学习代码通常先将每个参数作为张量对象在主机内存中反序列化，再将其复制到设备内存。我们发现这种方法未能充分利用存储吞吐量，并显著降低了使用广泛采用的模型文件格式safetensors加载大型模型的速度。本研究提出fastsafetensors——一个专为优化safetensors文件中张量反序列化而设计的Python库。我们的方法首先将磁盘上的参数组直接复制到设备内存，并在其中实例化为张量对象。该设计支持在底层I/O和高层张量预处理环节实施进一步优化，包括并行复制、点对点直接内存访问以及GPU卸载等。实验结果表明，在加载Llama（70亿、130亿和700亿参数）、Falcon（400亿参数）和Bloom（1760亿参数）等模型时，性能提升了4.8至7.5倍。

Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism

Abstract

arXiv:2505.23219v1 Announce Type: new Abstract: In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devices with the unified memory architecture. The key idea of Ghidorah can be summarized in two steps: 1) leveraging speculative decoding approaches to enhance parallelism, and 2) ingeniously distributing workloads across multiple heterogeneous processing units to maximize computing power utilization. Ghidorah includes the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. The HCMP architecture guides partitioning by leveraging the unified memory design of end-user devices and adapting to the hybrid computational demands of speculative decoding. The ARCA approach is used to determine the optimal speculative strategy and partitioning strategy, balancing acceptance rate with parallel capability to maximize the speedup. Additionally, we optimize sparse computation on ARM CPUs. Experimental results show that Ghidorah can achieve up to 7.6x speedup in the dominant LLM decoding phase compared to the sequential decoding approach in NVIDIA Jetson NX.

摘要

由于隐私优势及降低对外部基础设施的依赖，在终端用户设备上进行大语言模型（LLM）原位推理受到广泛关注。然而，由于解码过程受内存带宽限制，现代终端设备中的多样化处理单元无法被充分利用，导致LLM推理速度缓慢。本文提出Ghidorah——一种面向统一内存架构终端设备的LLM推理系统，其核心思想可归纳为两步：1）利用推测式解码方法增强并行性；2）通过智能分配异构处理单元的工作负载实现计算资源最大化利用。Ghidorah包含异构核心模型并行（HCMP）架构与架构感知性能分析（ARCA）方法：HCMP架构依托终端设备的统一内存设计进行任务划分，并适配推测式解码的混合计算需求；ARCA方法用于确定最优推测策略与划分策略，通过平衡接受率与并行能力实现加速比最大化。此外，我们还优化了ARM CPU的稀疏计算。实验表明，在NVIDIA Jetson NX平台上，Ghidorah相较于顺序解码方法在LLM主导的解码阶段可实现最高7.6倍的加速比。

Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models

Abstract

arXiv:2505.23091v1 Announce Type: new Abstract: Recent advancements in large language models (LLMs) have demonstrated substantial progress in reasoning capabilities, such as DeepSeek-R1, which leverages rule-based reinforcement learning to enhance logical reasoning significantly. However, extending these achievements to multimodal large language models (MLLMs) presents critical challenges, which are frequently more pronounced for Multimodal Small Language Models (MSLMs) given their typically weaker foundational reasoning abilities: (1) the scarcity of high-quality multimodal reasoning datasets, (2) the degradation of reasoning capabilities due to the integration of visual processing, and (3) the risk that direct application of reinforcement learning may produce complex yet incorrect reasoning processes. To address these challenges, we design a novel framework Infi-MMR to systematically unlock the reasoning potential of MSLMs through a curriculum of three carefully structured phases and propose our multimodal reasoning model Infi-MMR-3B. The first phase, Foundational Reasoning Activation, leverages high-quality textual reasoning datasets to activate and strengthen the model's logical reasoning capabilities. The second phase, Cross-Modal Reasoning Adaptation, utilizes caption-augmented multimodal data to facilitate the progressive transfer of reasoning skills to multimodal contexts. The third phase, Multimodal Reasoning Enhancement, employs curated, caption-free multimodal data to mitigate linguistic biases and promote robust cross-modal reasoning. Infi-MMR-3B achieves both state-of-the-art multimodal math reasoning ability (43.68% on MathVerse testmini, 27.04% on MathVision test, and 21.33% on OlympiadBench) and general reasoning ability (67.2% on MathVista testmini).

摘要

大型语言模型（LLMs）的最新进展在推理能力方面取得了显著进步，例如DeepSeek-R1通过基于规则的强化学习显著提升了逻辑推理能力。然而，将这些成果扩展到多模态大型语言模型（MLLMs）面临关键挑战，对于基础推理能力通常较弱的多模态小型语言模型（MSLMs）而言，这些问题尤为突出：（1）高质量多模态推理数据集的稀缺性；（2）视觉处理整合导致的推理能力退化；（3）直接应用强化学习可能产生复杂但错误的推理过程的风险。为解决这些挑战，我们设计了新颖框架Infi-MMR，通过三个精心设计的阶段性课程系统化释放MSLMs的推理潜力，并提出了多模态推理模型Infi-MMR-3B。第一阶段"基础推理激活"利用高质量文本推理数据集激活并增强模型的逻辑推理能力。第二阶段"跨模态推理适应"通过标题增强的多模态数据促进推理技能向多模态场景的渐进迁移。第三阶段"多模态推理强化"采用精选的无标题多模态数据以减少语言偏见并促进稳健的跨模态推理。Infi-MMR-3B在多模态数学推理能力（MathVerse testmini 43.68%、MathVision test 27.04%、OlympiadBench 21.33%）和通用推理能力（MathVista testmini 67.2%）上均达到最先进水平。

MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

Abstract

arXiv:2505.23254v1 Announce Type: new Abstract: Owing to the huge success of generative artificial intelligence (AI), large language models (LLMs) have emerged as a core subclass, underpinning applications such as question answering, text generation, and code completion. While fine-tuning these models on domain-specific data can yield significant performance gains, it also poses daunting computational challenges, especially for researchers and small organizations with limited hardware resources. Although SSD offloading (i.e., ZeRO-Infinity) has emerged as a viable strategy to overcome the GPU memory barrier via leveraging both system memory (i.e., CPU DRAM) and storage space (i.e., solid-state devices, SSDs), its design primarily targets model-centric performance issues. As a result, key system-level issues, including system memory fragmentation, inefficient pinned buffer allocation, peak CPU usage spikes, and file system overhead, remain unaddressed, stifling scalability and inflating costs. Such an observation motivates this paper to introduce MemAscend, a framework that systematically tackles the underexplored system memory bottlenecks in SSD-offloaded LLM training, with a focus on resource-constrained environments. By streamlining pinned-memory allocation, eradicating fragmentation, and mitigating peak overhead, MemAscend reclaims a substantial system memory budget, enabling larger models, longer context windows, and higher batch sizes without exceeding modest hardware limits. Across diverse LLM benchmarks, MemAscend reduces peak system-memory consumption by an average of 55.7% compared with standard SSD offloading techniques, lowering the hardware barrier for fine-tuning and unlocking new possibilities for cost-effective large-scale training on limited-resource machines.

摘要

由于生成式人工智能（AI）的巨大成功，大语言模型（LLM）已成为核心子类，支撑着问答、文本生成和代码补全等应用。虽然在领域特定数据上微调这些模型能显著提升性能，但也带来了巨大的计算挑战，尤其对硬件资源有限的研究者和小型机构而言。尽管SSD卸载技术（如ZeRO-Infinity）通过利用系统内存（CPU DRAM）和存储空间（固态硬盘）成为突破GPU内存壁垒的可行策略，但其设计主要针对模型中心的性能问题。这导致关键系统级问题——包括系统内存碎片、低效的固定缓冲区分配、CPU使用率峰值激增及文件系统开销——仍未解决，从而抑制了可扩展性并推高了成本。这一观察促使本文提出MemAscend框架，该系统化解决SSD卸载LLM训练中尚未充分探索的系统内存瓶颈，尤其关注资源受限环境。通过优化固定内存分配、消除碎片并降低峰值开销，MemAscend回收了大量系统内存预算，使得在有限硬件条件下能训练更大模型、支持更长上下文窗口和更高批量大小。在多样化LLM基准测试中，相比标准SSD卸载技术，MemAscend平均降低55.7%的系统内存峰值消耗，降低了微调的硬件门槛，为资源受限机器上经济高效的大规模训练开辟了新可能。

Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble

Abstract

arXiv:2505.23075v1 Announce Type: new Abstract: Despite the growing clinical adoption of large language models (LLMs), current approaches heavily rely on single model architectures. To overcome risks of obsolescence and rigid dependence on single model systems, we present a novel framework, termed the Consensus Mechanism. Mimicking clinical triage and multidisciplinary clinical decision-making, the Consensus Mechanism implements an ensemble of specialized medical expert agents enabling improved clinical decision making while maintaining robust adaptability. This architecture enables the Consensus Mechanism to be optimized for cost, latency, or performance, purely based on its interior model configuration. To rigorously evaluate the Consensus Mechanism, we employed three medical evaluation benchmarks: MedMCQA, MedQA, and MedXpertQA Text, and the differential diagnosis dataset, DDX+. On MedXpertQA, the Consensus Mechanism achieved an accuracy of 61.0% compared to 53.5% and 45.9% for OpenAI's O3 and Google's Gemini 2.5 Pro. Improvement was consistent across benchmarks with an increase in accuracy on MedQA ( $\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 3.4\%$ ) and MedMCQA ( $\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}} = 9.1\%$ ). These accuracy gains extended to differential diagnosis generation, where our system demonstrated improved recall and precision (F1 $_\mathrm{consensus}$ = 0.326 vs. F1 $_{\mathrm{O3\text{-}high}}$ = 0.2886) and a higher top-1 accuracy for DDX (Top1 $_\mathrm{consensus}$ = 52.0% vs. Top1 $_{\mathrm{O3\text{-}high}}$ = 45.2%).

摘要

尽管大型语言模型（LLMs）在临床中的应用日益广泛，但当前方法严重依赖单一模型架构。为克服模型过时和单一系统刚性依赖的风险，我们提出了一种称为"共识机制"的新框架。该机制通过模拟临床分诊和多学科临床决策流程，集成多个专业医学专家代理，在提升临床决策质量的同时保持强大的适应性。此架构使共识机制能够根据内部模型配置，灵活优化成本、延迟或性能表现。

为严格评估共识机制，我们采用三个医学评估基准（MedMCQA、MedQA和MedXpertQA Text）及鉴别诊断数据集DDX+。在MedXpertQA上，共识机制准确率达61.0%，优于OpenAI O3（53.5%）和Google Gemini 2.5 Pro（45.9%）。各基准测试均显示性能提升：MedQA准确率提升3.4%（ $\Delta\mathrm{Accuracy}_{\mathrm{consensus\text{-}O3}}$ ），MedMCQA提升9.1%。该优势延伸至鉴别诊断生成领域，本系统在召回率与精确度（F1 $_\mathrm{consensus}$ =0.326 vs. F1 $_{\mathrm{O3\text{-}high}}$ =0.2886）及DDX top-1准确率（Top1 $_\mathrm{consensus}$ =52.0% vs. Top1 $_{\mathrm{O3\text{-}high}}$ =45.2%）上均表现出显著改进。

Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems

Abstract

arXiv:2505.23352v1 Announce Type: new Abstract: The communication topology in large language model-based multi-agent systems fundamentally governs inter-agent collaboration patterns, critically shaping both the efficiency and effectiveness of collective decision-making. While recent studies for communication topology automated design tend to construct sparse structures for efficiency, they often overlook why and when sparse and dense topologies help or hinder collaboration. In this paper, we present a causal framework to analyze how agent outputs, whether correct or erroneous, propagate under topologies with varying sparsity. Our empirical studies reveal that moderately sparse topologies, which effectively suppress error propagation while preserving beneficial information diffusion, typically achieve optimal task performance. Guided by this insight, we propose a novel topology design approach, EIB-leanrner, that balances error suppression and beneficial information propagation by fusing connectivity patterns from both dense and sparse graphs. Extensive experiments show the superior effectiveness, communication cost, and robustness of EIB-leanrner.

摘要

基于大语言模型的多智能体系统中，通信拓扑结构从根本上决定了智能体间的协作模式，对集体决策的效率和效果具有关键性影响。尽管近期关于通信拓扑自动设计的研究倾向于构建稀疏结构以提升效率，但这些研究往往忽视了稀疏与密集拓扑在何种情况下会促进或阻碍协作的内在机制。本文提出一个因果分析框架，用以研究智能体输出（无论正确或错误）在不同稀疏度拓扑下的传播规律。实证研究表明，适度稀疏的拓扑能有效抑制错误传播同时保留有益信息扩散，通常可实现最优任务性能。基于此发现，我们提出一种新型拓扑设计方法EIB-leanrner，通过融合密集图与稀疏图的连接模式，实现错误抑制与有益信息传播的平衡。大量实验证明EIB-leanrner在效能、通信成本和鲁棒性方面均具有显著优势。

MathArena: Evaluating LLMs on Uncontaminated Math Competitions

Abstract

arXiv:2505.23281v1 Announce Type: new Abstract: The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as SMT 2025 -- published well after model release dates -- demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On USAMO 2025, even top models score below 25%, far behind their performance on final-answer tasks. So far, we have evaluated 30 models across five competitions, totaling 149 problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.

摘要

大型语言模型（LLMs）推理能力的快速提升显著改善了数学基准测试表现。然而，当前最常用的评估数据集（如AIME 2024）大多可在线公开获取，导致难以区分模型是真正具备推理能力还是存在记忆现象。此外，这些基准未能评估对数学任务至关重要的证明撰写能力。为此，我们提出MathArena基准，其核心思路在于：周期性数学竞赛能持续提供高质量、高难度题目，可用于LLMs的实时评估。通过在新题发布后立即测试模型，我们有效消除了数据污染风险。基于该框架，我们在AIME 2024中发现了明显的数据污染迹象。而在模型发布后新推出的高难度竞赛（如SMT 2025）上，顶尖模型仍展现出卓越的推理能力。MathArena也是首个评估证明撰写能力的基准——在USAMO 2025测试中，即使最优模型得分也不足25%，远低于其最终答案类任务的表现。目前我们已评估30个模型在5项竞赛中的149道题目。作为动态基准，MathArena将持续追踪LLMs在新发布竞赛中的表现，确保对数学推理能力进行严格且与时俱进的评估。

Autoformalization in the Era of Large Language Models: A Survey

Abstract

arXiv:2505.23486v1 Announce Type: new Abstract: Autoformalization, the process of transforming informal mathematical propositions into verifiable formal representations, is a foundational task in automated theorem proving, offering a new perspective on the use of mathematics in both theoretical and applied domains. Driven by the rapid progress in artificial intelligence, particularly large language models (LLMs), this field has witnessed substantial growth, bringing both new opportunities and unique challenges. In this survey, we provide a comprehensive overview of recent advances in autoformalization from both mathematical and LLM-centric perspectives. We examine how autoformalization is applied across various mathematical domains and levels of difficulty, and analyze the end-to-end workflow from data preprocessing to model design and evaluation. We further explore the emerging role of autoformalization in enhancing the verifiability of LLM-generated outputs, highlighting its potential to improve both the trustworthiness and reasoning capabilities of LLMs. Finally, we summarize key open-source models and datasets supporting current research, and discuss open challenges and promising future directions for the field.

摘要

自动形式化是将非形式化的数学命题转化为可验证的形式化表示的过程，作为自动化定理证明的基础性任务，其为理论及应用领域中数学运用的研究提供了新视角。在人工智能尤其是大语言模型（LLMs）快速发展的推动下，该领域取得了显著进展，既带来新机遇也面临独特挑战。本综述从数学与LLM核心视角出发，全面梳理了自动形式化的最新进展。我们考察了该方法在不同数学领域及难度层级中的应用情况，系统分析了从数据预处理到模型设计与评估的端到端工作流程。进一步探讨了自动形式化在增强LLM生成内容可验证性方面的新兴作用，揭示了其在提升LLMs可信度与推理能力方面的潜力。最后，我们总结了支撑当前研究的关键开源模型与数据集，并就该领域的开放挑战与未来发展方向进行了讨论。

KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction

Abstract

arXiv:2505.23416v1 Announce Type: new Abstract: Transformer-based large language models (LLMs) cache context as key-value (KV) pairs during inference. As context length grows, KV cache sizes expand, leading to substantial memory overhead and increased attention latency. This paper introduces KVzip, a query-agnostic KV cache eviction method enabling effective reuse of compressed KV caches across diverse queries. KVzip quantifies the importance of a KV pair using the underlying LLM to reconstruct original contexts from cached KV pairs, subsequently evicting pairs with lower importance. Extensive empirical evaluations demonstrate that KVzip reduces KV cache size by 3-4 $\times$ and FlashAttention decoding latency by approximately 2 $\times$ , with negligible performance loss in question-answering, retrieval, reasoning, and code comprehension tasks. Evaluations include various models such as LLaMA3.1-8B, Qwen2.5-14B, and Gemma3-12B, with context lengths reaching up to 170K tokens. KVzip significantly outperforms existing query-aware KV eviction methods, which suffer from performance degradation even at a 90% cache budget ratio under multi-query scenarios.

摘要

基于Transformer架构的大语言模型（LLM）在推理过程中会将上下文缓存为键值（KV）对。随着上下文长度增加，KV缓存规模扩大，导致显著的内存开销和注意力延迟上升。本文提出KVzip方法，这是一种与查询无关的KV缓存淘汰机制，能够实现对压缩后KV缓存在不同查询间的有效复用。KVzip通过底层LLM量化KV对的重要性，利用缓存的KV对重构原始上下文，进而淘汰重要性较低的键值对。大量实验评估表明，KVzip可将KV缓存体积缩小3-4倍，使FlashAttention解码延迟降低约2倍，且在问答、检索、推理和代码理解等任务中性能损失可忽略不计。评估涵盖LLaMA3.1-8B、Qwen2.5-14B和Gemma3-12B等多种模型，上下文长度最高达17万token。KVzip显著优于现有查询感知型KV淘汰方法——后者在多查询场景下即使保留90%缓存预算仍会出现性能下降。

Case-Based Reasoning Enhances the Predictive Power of LLMs in Drug-Drug Interaction

Abstract

arXiv:2505.23034v1 Announce Type: new Abstract: Drug-drug interaction (DDI) prediction is critical for treatment safety. While large language models (LLMs) show promise in pharmaceutical tasks, their effectiveness in DDI prediction remains challenging. Inspired by the well-established clinical practice where physicians routinely reference similar historical cases to guide their decisions through case-based reasoning (CBR), we propose CBR-DDI, a novel framework that distills pharmacological principles from historical cases to improve LLM reasoning for DDI tasks. CBR-DDI constructs a knowledge repository by leveraging LLMs to extract pharmacological insights and graph neural networks (GNNs) to model drug associations. A hybrid retrieval mechanism and dual-layer knowledge-enhanced prompting allow LLMs to effectively retrieve and reuse relevant cases. We further introduce a representative sampling strategy for dynamic case refinement. Extensive experiments demonstrate that CBR-DDI achieves state-of-the-art performance, with a significant 28.7% accuracy improvement over both popular LLMs and CBR baseline, while maintaining high interpretability and flexibility.

摘要

药物相互作用（DDI）预测对治疗安全性至关重要。尽管大语言模型（LLM）在药学任务中展现出潜力，但其在DDI预测中的有效性仍面临挑战。受临床实践中医生常规参考类似历史病例通过案例推理（CBR）指导决策的成熟方法启发，我们提出CBR-DDI框架——该框架通过从历史案例中提炼药理学原理来增强LLM在DDI任务中的推理能力。CBR-DDI构建知识库时采用LLM提取药理学洞见，并利用图神经网络（GNN）建模药物关联关系。通过混合检索机制和双层知识增强提示策略，LLM能够有效检索并复用相关案例。我们还引入动态案例优化的代表性采样策略。大量实验表明，CBR-DDI实现了最先进的性能，相较于主流LLM和CBR基线模型准确率显著提升28.7%，同时保持高解释性与灵活性。

Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns

Abstract

arXiv:2505.23474v1 Announce Type: new Abstract: Process Reward Models (PRMs) are crucial in complex reasoning and problem-solving tasks (e.g., LLM agents with long-horizon decision-making) by verifying the correctness of each intermediate reasoning step. In real-world scenarios, LLMs may apply various reasoning patterns (e.g., decomposition) to solve a problem, potentially suffering from errors under various reasoning patterns. Therefore, PRMs are required to identify errors under various reasoning patterns during the reasoning process. However, existing benchmarks mainly focus on evaluating PRMs with stepwise correctness, ignoring a systematic evaluation of PRMs under various reasoning patterns. To mitigate this gap, we introduce Socratic-PRMBench, a new benchmark to evaluate PRMs systematically under six reasoning patterns, including Transformation, Decomposition, Regather, Deduction, Verification, and Integration. Socratic-PRMBench}comprises 2995 reasoning paths with flaws within the aforementioned six reasoning patterns. Through our experiments on both PRMs and LLMs prompted as critic models, we identify notable deficiencies in existing PRMs. These observations underscore the significant weakness of current PRMs in conducting evaluations on reasoning steps under various reasoning patterns. We hope Socratic-PRMBench can serve as a comprehensive testbed for systematic evaluation of PRMs under diverse reasoning patterns and pave the way for future development of PRMs.

摘要

过程奖励模型（PRMs）在复杂推理和问题解决任务（如具有长期决策能力的大语言模型智能体）中具有关键作用，其通过验证每个中间推理步骤的正确性来实现功能。在现实场景中，大语言模型可能采用多种推理模式（如分解）来解决问题，但不同推理模式下均可能出现错误。因此，过程奖励模型需要能够在推理过程中识别各种推理模式下的错误。然而，现有基准主要关注基于逐步正确性的PRMs评估，缺乏对不同推理模式下PRMs的系统性评估。为弥补这一不足，我们提出Socratic-PRMBench——一个在六种推理模式（包括转换、分解、重组、演绎、验证和整合）下系统评估PRMs的新基准。该基准包含2995条在上述六种推理模式中存在缺陷的推理路径。通过对PRMs及作为批评模型的大语言模型进行实验，我们发现现有PRMs存在显著不足。这些发现揭示了当前PRMs在多种推理模式下评估推理步骤时的重大缺陷。我们希望Socratic-PRMBench能成为系统评估不同推理模式下PRMs的综合测试平台，并为PRMs的未来发展奠定基础。

Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters

Abstract

arXiv:2505.23554v1 Announce Type: new Abstract: In recent years, Large Language Models (LLM) such as ChatGPT, CoPilot, and Gemini have been widely adopted in different areas. As the use of LLMs continues to grow, many efforts have focused on reducing the massive training overheads of these models. But it is the environmental impact of handling user requests to LLMs that is increasingly becoming a concern. Recent studies estimate that the costs of operating LLMs in their inference phase can exceed training costs by 25x per year. As LLMs are queried incessantly, the cumulative carbon footprint for the operational phase has been shown to far exceed the footprint during the training phase. Further, estimates indicate that 500 ml of fresh water is expended for every 20-50 requests to LLMs during inference. To address these important sustainability issues with LLMs, we propose a novel framework called SLIT to co-optimize LLM quality of service (time-to-first token), carbon emissions, water usage, and energy costs. The framework utilizes a machine learning (ML) based metaheuristic to enhance the sustainability of LLM hosting across geo-distributed cloud datacenters. Such a framework will become increasingly vital as LLMs proliferate.

摘要

近年来，诸如ChatGPT、CoPilot和Gemini等大型语言模型（LLM）已在多个领域得到广泛应用。随着LLM使用量的持续增长，许多研究致力于降低这些模型的庞大训练开销。然而，处理用户对LLM请求所产生的环境影响正日益引发关注。最新研究表明，LLM在推理阶段的年运营成本可能超过训练成本的25倍。由于LLM持续接收查询，其运营阶段的累计碳足迹已远超训练阶段。此外，估算显示LLM推理阶段每处理20-50次请求就会消耗500毫升淡水。为解决LLM这些重要的可持续性问题，我们提出名为SLIT的新型框架，协同优化LLM服务质量（首令牌响应时间）、碳排放、水资源消耗及能源成本。该框架采用基于机器学习（ML）的元启发式算法，以提升跨地域分布式云数据中心托管LLM的可持续性。随着LLM的普及，此类框架将变得愈发关键。

EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions

Abstract

arXiv:2505.23473v1 Announce Type: new Abstract: Large language models (LLMs) frequently refuse to respond to pseudo-malicious instructions: semantically harmless input queries triggering unnecessary LLM refusals due to conservative safety alignment, significantly impairing user experience. Collecting such instructions is crucial for evaluating and mitigating over-refusals, but existing instruction curation methods, like manual creation or instruction rewriting, either lack scalability or fail to produce sufficiently diverse and effective refusal-inducing prompts. To address these limitations, we introduce EVOREFUSE, a prompt optimization approach that generates diverse pseudo-malicious instructions consistently eliciting confident refusals across LLMs. EVOREFUSE employs an evolutionary algorithm exploring the instruction space in more diverse directions than existing methods via mutation strategies and recombination, and iteratively evolves seed instructions to maximize evidence lower bound on LLM refusal probability. Using EVOREFUSE, we create two novel datasets: EVOREFUSE-TEST, a benchmark of 582 pseudo-malicious instructions that outperforms the next-best benchmark with 140.41% higher average refusal triggering rate across 9 LLMs, 34.86% greater lexical diversity, and 40.03% improved LLM response confidence scores; and EVOREFUSE-ALIGN, which provides 3,000 pseudo-malicious instructions with responses for supervised and preference-based alignment training. LLAMA3.1-8B-INSTRUCT supervisedly fine-tuned on EVOREFUSE-ALIGN achieves up to 14.31% fewer over-refusals than models trained on the second-best alignment dataset, without compromising safety. Our analysis with EVOREFUSE-TEST reveals models trigger over-refusals by overly focusing on sensitive keywords while ignoring broader context.

摘要

大语言模型（LLMs）经常拒绝响应伪恶意指令：这些语义无害的输入查询因保守的安全对齐策略触发了不必要的模型拒绝，显著损害了用户体验。收集此类指令对于评估和缓解过度拒绝行为至关重要，但现有的指令构建方法（如人工创建或指令改写）要么缺乏可扩展性，要么无法生成足够多样且有效的诱发拒绝提示。为解决这些局限性，我们提出了EVOREFUSE——一种提示优化方法，能生成多样化的伪恶意指令，持续在不同LLMs中引发高置信度的拒绝。EVOREFUSE采用进化算法，通过突变策略和重组在指令空间中进行比现有方法更多样化的探索，并迭代优化种子指令以最大化模型拒绝概率的证据下界。基于EVOREFUSE，我们创建了两个新数据集：EVOREFUSE-TEST（包含582条伪恶意指令的基准测试集，在9个LLMs上的平均拒绝触发率比次优基准高140.41%，词汇多样性提升34.86%，模型响应置信度得分提高40.03%）和EVOREFUSE-ALIGN（提供3,000条带标注响应的伪恶意指令，用于监督式和基于偏好的对齐训练）。在EVOREFUSE-ALIGN上监督微调的LLAMA3.1-8B-INSTRUCT模型，其过度拒绝现象比使用次优对齐数据集训练的模型减少达14.31%，且不影响安全性。我们通过EVOREFUSE-TEST分析发现，模型过度拒绝的机制源于对敏感关键词的过度关注而忽略整体上下文。

CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring

Abstract

arXiv:2505.23575v1 Announce Type: new Abstract: As AI models are deployed with increasing autonomy, it is important to ensure they do not take harmful actions unnoticed. As a potential mitigation, we investigate Chain-of-Thought (CoT) monitoring, wherein a weaker trusted monitor model continuously oversees the intermediate reasoning steps of a more powerful but untrusted model. We compare CoT monitoring to action-only monitoring, where only final outputs are reviewed, in a red-teaming setup where the untrusted model is instructed to pursue harmful side tasks while completing a coding problem. We find that CoT monitoring improves detection by up to 27 percentage points in scenarios where action-only monitoring fails to reliably identify sabotage. However, CoT traces can also contain misleading rationalizations that deceive the monitor, reducing performance in more obvious sabotage cases. To address this, we introduce a hybrid protocol that independently scores both reasoning and final outputs and combines them using a weighted average. This hybrid monitor consistently outperforms both CoT and action-only monitors across all tested models and tasks, with detection rates over four times higher than action-only monitoring for subtle deception scenarios.

摘要

随着人工智能模型自主性日益增强，确保其不会在无人察觉的情况下采取有害行为变得至关重要。作为一种潜在解决方案，我们研究了思维链（CoT）监控方法——该方法通过一个较弱但可信的监控模型持续监督更强大但不可信模型的中间推理步骤。在红队测试环境中（不可信模型被指示在完成编码任务时执行有害的次要任务），我们将CoT监控与仅审查最终输出的行为监控进行比较。研究发现，在行为监控无法可靠识别破坏行为的场景中，CoT监控可将检测率提升高达27个百分点。然而，思维链轨迹也可能包含误导性合理化解释以欺骗监控器，导致在更明显的破坏案例中性能下降。为此，我们提出一种混合协议：独立评分推理过程与最终输出，并通过加权平均进行综合。这种混合监控器在所有测试模型和任务中均持续优于纯CoT和行为监控，对于隐蔽欺骗场景的检测率比纯行为监控高出四倍以上。

TRAP: Targeted Redirecting of Agentic Preferences

Abstract

arXiv:2505.23518v1 Announce Type: new Abstract: Autonomous agentic AI systems powered by vision-language models (VLMs) are rapidly advancing toward real-world deployment, yet their cross-modal reasoning capabilities introduce new attack surfaces for adversarial manipulation that exploit semantic reasoning across modalities. Existing adversarial attacks typically rely on visible pixel perturbations or require privileged model or environment access, making them impractical for stealthy, real-world exploitation. We introduce TRAP, a generative adversarial framework that manipulates the agent's decision-making using diffusion-based semantic injections. Our method combines negative prompt-based degradation with positive semantic optimization, guided by a Siamese semantic network and layout-aware spatial masking. Without requiring access to model internals, TRAP produces visually natural images yet induces consistent selection biases in agentic AI systems. We evaluate TRAP on the Microsoft Common Objects in Context (COCO) dataset, building multi-candidate decision scenarios. Across these scenarios, TRAP achieves a 100% attack success rate on leading models, including LLaVA-34B, Gemma3, and Mistral-3.1, significantly outperforming baselines such as SPSA, Bandit, and standard diffusion approaches. These results expose a critical vulnerability: Autonomous agents can be consistently misled through human-imperceptible cross-modal manipulations. These findings highlight the need for defense strategies beyond pixel-level robustness to address semantic vulnerabilities in cross-modal decision-making.

摘要

基于视觉语言模型（VLMs）的自主智能AI系统正迅速向现实世界部署迈进，但其跨模态推理能力引入了新的对抗性操纵攻击面，这些攻击利用跨模态的语义推理。现有的对抗攻击通常依赖于可见像素扰动或需要特权模型或环境访问权限，这使得它们在隐蔽的实际应用中不切实际。我们提出了TRAP，一种生成对抗框架，通过基于扩散的语义注入来操纵智能体的决策。我们的方法结合了基于负面提示的退化与正面语义优化，并由连体语义网络和布局感知的空间掩码指导。TRAP无需访问模型内部，即可生成视觉上自然的图像，同时在自主AI系统中诱导出一致的选择偏差。我们在微软通用上下文对象（COCO）数据集上评估TRAP，构建多候选决策场景。在这些场景中，TRAP在包括LLaVA-34B、Gemma3和Mistral-3.1在内的领先模型上实现了100%的攻击成功率，显著优于SPSA、Bandit和标准扩散方法等基线。这些结果揭示了一个关键漏洞：自主智能体可以通过人类难以察觉的跨模态操纵被持续误导。这些发现强调，需要超越像素级鲁棒性的防御策略，以应对跨模态决策中的语义漏洞。

SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents

Abstract

arXiv:2505.23559v1 Announce Type: new Abstract: Recent advancements in large language model (LLM) agents have significantly accelerated scientific discovery automation, yet concurrently raised critical ethical and safety concerns. To systematically address these challenges, we introduce \textbf{SafeScientist}, an innovative AI scientist framework explicitly designed to enhance safety and ethical responsibility in AI-driven scientific exploration. SafeScientist proactively refuses ethically inappropriate or high-risk tasks and rigorously emphasizes safety throughout the research process. To achieve comprehensive safety oversight, we integrate multiple defensive mechanisms, including prompt monitoring, agent-collaboration monitoring, tool-use monitoring, and an ethical reviewer component. Complementing SafeScientist, we propose \textbf{SciSafetyBench}, a novel benchmark specifically designed to evaluate AI safety in scientific contexts, comprising 240 high-risk scientific tasks across 6 domains, alongside 30 specially designed scientific tools and 120 tool-related risk tasks. Extensive experiments demonstrate that SafeScientist significantly improves safety performance by 35% compared to traditional AI scientist frameworks, without compromising scientific output quality. Additionally, we rigorously validate the robustness of our safety pipeline against diverse adversarial attack methods, further confirming the effectiveness of our integrated approach. The code and data will be available at https://github.com/ulab-uiuc/SafeScientist. \textcolor{red}{Warning: this paper contains example data that may be offensive or harmful.}

摘要

尽管大语言模型（LLM）智能体的最新进展显著加速了科学发现的自动化进程，但同时也引发了关键的伦理与安全问题。为系统应对这些挑战，我们提出SafeScientist——一个创新性AI科学家框架，其核心设计目标在于增强AI驱动科学探索的安全性与伦理责任。该框架能主动拒绝伦理失当或高风险任务，并在研究全流程严格贯彻安全准则。为实现全面安全监管，我们整合了多重防御机制，包括提示监控、智能体协作监控、工具使用监控以及伦理审查组件。作为配套评估体系，我们提出SciSafetyBench基准测试，专门用于评估科学场景下的AI安全性，包含6大领域的240项高风险科学任务、30种特制科学工具及120项工具相关风险任务。大量实验表明，与传统AI科学家框架相比，SafeScientist在保持科研成果质量的同时，将安全性能提升35%。此外，我们针对多种对抗攻击方法严格验证了安全管道的鲁棒性，进一步证实了集成方案的有效性。代码与数据将在https://github.com/ulab-uiuc/SafeScientist公开。警告：本文包含可能具有冒犯性或危害性的示例数据。

Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models

Abstract

arXiv:2505.23667v1 Announce Type: new Abstract: Tables are a fundamental structure for organizing and analyzing data, making effective table understanding a critical capability for intelligent systems. While large language models (LMs) demonstrate strong general reasoning abilities, they continue to struggle with accurate numerical or symbolic reasoning over tabular data, especially in complex scenarios. Spreadsheet formulas provide a powerful and expressive medium for representing executable symbolic operations, encoding rich reasoning patterns that remain largely underutilized. In this paper, we propose Formula Tuning (Fortune), a reinforcement learning (RL) framework that trains LMs to generate executable spreadsheet formulas for question answering over general tabular data. Formula Tuning reduces the reliance on supervised formula annotations by using binary answer correctness as a reward signal, guiding the model to learn formula derivation through reasoning. We provide a theoretical analysis of its advantages and demonstrate its effectiveness through extensive experiments on seven table reasoning benchmarks. Formula Tuning substantially enhances LM performance, particularly on multi-step numerical and symbolic reasoning tasks, enabling a 7B model to outperform O1 on table understanding. This highlights the potential of formula-driven RL to advance symbolic table reasoning in LMs.

摘要

摘要：表格是组织与分析数据的基础结构，使有效的表格理解成为智能系统的关键能力。尽管大语言模型（LMs）展现出强大的通用推理能力，但在处理复杂场景下表格数据的精确数值或符号推理时仍存在困难。电子表格公式作为一种强大且富有表达力的媒介，能够表征可执行的符号操作，其编码的丰富推理模式目前尚未得到充分利用。本文提出公式调优（Formula Tuning，简称Fortune）——一种基于强化学习（RL）的框架，通过训练LMs生成可执行的电子表格公式来实现通用表格数据的问答。该方法利用二元答案正确性作为奖励信号，减少了对监督式公式标注的依赖，引导模型通过推理学习公式推导过程。我们通过理论分析阐明了其优势，并在七个表格推理基准测试中通过大量实验验证了其有效性。公式调优显著提升了LMs的性能，尤其在多步数值与符号推理任务上，使一个70亿参数的模型在表格理解任务中超越O1模型。这一成果凸显了公式驱动强化学习在推进LMs符号化表格推理方面的潜力。

Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics

Abstract

arXiv:2505.23695v1 Announce Type: new Abstract: The rapid advancement of LLMs has led to the creation of diverse agentic systems in data analysis, utilizing LLMs' capabilities to improve insight generation and visualization. In this paper, we present an agentic system that automates the data-to-dashboard pipeline through modular LLM agents capable of domain detection, concept extraction, multi-perspective analysis generation, and iterative self-reflection. Unlike existing chart QA systems, our framework simulates the analytical reasoning process of business analysts by retrieving domain-relevant knowledge and adapting to diverse datasets without relying on closed ontologies or question templates. We evaluate our system on three datasets across different domains. Benchmarked against GPT-4o with a single-prompt baseline, our approach shows improved insightfulness, domain relevance, and analytical depth, as measured by tailored evaluation metrics and qualitative human assessment. This work contributes a novel modular pipeline to bridge the path from raw data to visualization, and opens new opportunities for human-in-the-loop validation by domain experts in business analytics. All code can be found here: https://github.com/77luvC/D2D_Data2Dashboard

摘要

大型语言模型（LLMs）的快速发展催生了数据分析领域多样化的智能体系统，这些系统利用LLMs的能力提升洞察生成与可视化水平。本文提出一种通过模块化LLM智能体实现数据到仪表盘全流程自动化的智能系统，其具备领域检测、概念提取、多视角分析生成及迭代自反思能力。与现有图表问答系统不同，本框架通过检索领域相关知识并适应多样化数据集，模拟了商业分析师的推理过程，且无需依赖封闭本体论或问题模板。

我们在三个不同领域的数据集上评估系统性能。相较于采用单提示基准的GPT-4o，经定制化评估指标和人工定性测评，本方法在洞察深度、领域相关性和分析层次上均表现出显著提升。

本研究贡献了一种连接原始数据与可视化呈现的新型模块化流程，为商业分析领域专家参与人机协同验证开辟了新途径。全部代码参见：https://github.com/77luvC/D2D_Data2Dashboard

ZeroGUI: Automating Online GUI Learning at Zero Human Cost

Abstract

arXiv:2505.23762v1 Announce Type: new Abstract: The rapid advancement of large Vision-Language Models (VLMs) has propelled the development of pure-vision-based GUI Agents, capable of perceiving and operating Graphical User Interfaces (GUI) to autonomously fulfill user instructions. However, existing approaches usually adopt an offline learning framework, which faces two core limitations: (1) heavy reliance on high-quality manual annotations for element grounding and action supervision, and (2) limited adaptability to dynamic and interactive environments. To address these limitations, we propose ZeroGUI, a scalable, online learning framework for automating GUI Agent training at Zero human cost. Specifically, ZeroGUI integrates (i) VLM-based automatic task generation to produce diverse training goals from the current environment state, (ii) VLM-based automatic reward estimation to assess task success without hand-crafted evaluation functions, and (iii) two-stage online reinforcement learning to continuously interact with and learn from GUI environments. Experiments on two advanced GUI Agents (UI-TARS and Aguvis) demonstrate that ZeroGUI significantly boosts performance across OSWorld and AndroidLab environments. The code is available at https://github.com/OpenGVLab/ZeroGUI.

摘要

大型视觉语言模型（VLM）的快速发展推动了纯视觉GUI代理的进步，这些代理能够感知并操作图形用户界面（GUI），以自主完成用户指令。然而，现有方法通常采用离线学习框架，存在两个核心局限：（1）高度依赖高质量人工标注进行元素定位和动作监督；（2）对动态交互环境的适应性有限。为解决这些问题，我们提出ZeroGUI——一种可扩展的在线学习框架，能够在零人工成本下自动化训练GUI代理。具体而言，ZeroGUI整合了：（i）基于VLM的自动任务生成，从当前环境状态产生多样化训练目标；（ii）基于VLM的自动奖励评估，无需人工设计评价函数即可判断任务成功与否；（iii）两阶段在线强化学习，持续与GUI环境交互并从中学习。在两个先进GUI代理（UI-TARS和Aguvis）上的实验表明，ZeroGUI在OSWorld和AndroidLab环境中显著提升了性能。代码已开源：https://github.com/OpenGVLab/ZeroGUI。

Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability

Abstract

arXiv:2505.23703v1 Announce Type: new Abstract: Enhancing the mathematical reasoning capabilities of LLMs has garnered significant attention in both the mathematical and computer science communities. Recent works have made substantial progress in both Natural Language (NL) reasoning and Formal Language (FL) reasoning by leveraging the potential of pure Reinforcement Learning (RL) methods on base models. However, RL approaches struggle to impart new capabilities not presented in the base model, highlighting the need to integrate more knowledge like FL into NL math reasoning effectively. Yet, this integration is challenging due to inherent disparities in problem structure and reasoning format between NL and FL. To address these challenges, we introduce NL-FL HybridReasoning, an end-to-end framework designed to incorporate the FL expert into NL math problem-solving. To bridge the NL and FL input format gap, we propose the NL-FL Problem Alignment method, which reformulates the Question-Answering (QA) problems in NL as existence theorems in FL. Subsequently, the Mixed Problem Input technique we provide enables the FL reasoner to handle both QA and existence problems concurrently. Lastly, we mitigate the NL and FL output format gap in reasoning through an LLM-based Answer Extraction mechanism. Comprehensive experiments demonstrate that the HybridReasoning framework achieves 89.80% and 84.34% accuracy rates on the MATH-500 and the AMC benchmarks, surpassing the NL baseline by 4.60% and 4.82%, respectively. Notably, some problems resolved by our framework remain unsolved by the NL baseline model even under a larger number of trials.

摘要

提升大语言模型（LLMs）的数学推理能力已成为数学与计算机科学领域的重要研究方向。近期工作通过基于纯强化学习（RL）方法在基础模型上的应用，在自然语言（NL）推理和形式语言（FL）推理方面取得显著进展。然而，RL方法难以赋予基础模型未展现的新能力，这凸显了将FL等知识有效整合至NL数学推理的必要性。但由于NL与FL在问题结构和推理形式上存在固有差异，这种整合具有挑战性。为此，我们提出NL-FL混合推理框架，该端到端框架旨在将FL专家系统融入NL数学问题求解过程。为弥合NL与FL输入格式差异，我们提出NL-FL问题对齐方法，将NL中的问答（QA）问题重构为FL中的存在性定理。随后，通过混合问题输入技术使FL推理器能同步处理QA与存在性证明问题。最后，我们采用基于LLM的答案提取机制来消除推理过程中NL与FL输出格式的差异。综合实验表明，混合推理框架在MATH-500和AMC基准测试中分别达到**89.80%和84.34%**的准确率，较NL基线模型提升4.60%和4.82%。值得注意的是，本框架解决的若干问题即便在增加尝试次数后仍无法被NL基线模型攻克。

In Dialogue with Intelligence: Rethinking Large Language Models as Collective Knowledge

Abstract

arXiv:2505.22767v1 Announce Type: cross Abstract: Large Language Models (LLMs) are typically analysed through architectural, behavioural, or training-data lenses. This article offers a theoretical and experiential re-framing: LLMs as dynamic instantiations of Collective human Knowledge (CK), where intelligence is evoked through dialogue rather than stored statically. Drawing on concepts from neuroscience and AI, and grounded in sustained interaction with ChatGPT-4, I examine emergent dialogue patterns, the implications of fine-tuning, and the notion of co-augmentation: mutual enhancement between human and machine cognition. This perspective offers a new lens for understanding interaction, representation, and agency in contemporary AI systems.

摘要

大型语言模型（LLMs）通常通过架构、行为或训练数据的视角进行分析。本文提出一种理论与经验相结合的重构框架：将LLMs视为人类集体知识（CK）的动态具现化，其智能通过对话被激发而非静态存储。借鉴神经科学与人工智能的概念，并基于与ChatGPT-4的持续交互实践，本研究探讨了对话模式的涌现现象、微调技术的深层影响以及协同增强机制——即人类认知与机器智能的相互提升。该视角为理解当代AI系统中的交互模式、知识表征与智能代理提供了新的理论透镜。

Pre-Training Curriculum for Multi-Token Prediction in Language Models

Abstract

arXiv:2505.22757v1 Announce Type: cross Abstract: Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.

摘要

多令牌预测（MTP）是近期提出的一种语言模型预训练目标。与仅预测下一个令牌（NTP）不同，MTP通过多个预测头在每个预测步骤中同时预测后续 $k$ 个令牌。研究表明，MTP在提升下游任务性能、推理速度和训练效率方面具有潜力，尤其适用于大型模型。然而，已有工作发现较小规模语言模型（SLM）难以有效适应MTP目标。为此，我们提出一种MTP训练的课程学习策略，探索两种变体：正向课程（从NTP逐步过渡到MTP以增加目标复杂度）与逆向课程（采用相反路径）。实验表明，正向课程能使SLM在预训练阶段更好地利用MTP目标，既提升下游NTP性能与生成输出质量，又保留自推测解码的优势；而逆向课程虽能获得更强的NTP性能与输出质量，却无法提供任何自推测解码的收益。

Training Language Models to Generate Quality Code with Program Analysis Feedback

Abstract

arXiv:2505.22704v1 Announce Type: cross Abstract: Code generation with large language models (LLMs), often termed vibe coding, is increasingly adopted in production but fails to ensure code quality, particularly in security (e.g., SQL injection vulnerabilities) and maintainability (e.g., missing type annotations). Existing methods, such as supervised fine-tuning and rule-based post-processing, rely on labor-intensive annotations or brittle heuristics, limiting their scalability and effectiveness. We propose REAL, a reinforcement learning framework that incentivizes LLMs to generate production-quality code using program analysis-guided feedback. Specifically, REAL integrates two automated signals: (1) program analysis detecting security or maintainability defects and (2) unit tests ensuring functional correctness. Unlike prior work, our framework is prompt-agnostic and reference-free, enabling scalable supervision without manual intervention. Experiments across multiple datasets and model scales demonstrate that REAL outperforms state-of-the-art methods in simultaneous assessments of functionality and code quality. Our work bridges the gap between rapid prototyping and production-ready code, enabling LLMs to deliver both speed and quality.

摘要

基于大语言模型（LLMs）的代码生成（常称为氛围编程）正逐渐被应用于实际生产，但其难以保障代码质量，尤其是在安全性（如SQL注入漏洞）和可维护性（如缺失类型注解）方面。现有方法如监督微调和基于规则的后处理依赖于人工密集型标注或脆弱的启发式规则，限制了其扩展性和有效性。我们提出REAL框架——一种通过程序分析引导反馈的强化学习范式，激励LLMs生成生产级质量代码。具体而言，REAL整合了两类自动化信号：（1）检测安全或可维护性缺陷的程序分析；（2）确保功能正确性的单元测试。与先前工作不同，本框架具备提示无关性和无参考性特征，可在无需人工干预的情况下实现可扩展的监督。跨多数据集和模型规模的实验表明，REAL在功能性与代码质量的同步评估中优于现有最优方法。本研究弥合了快速原型开发与生产就绪代码之间的鸿沟，使LLMs能够兼顾效率与质量。

A Tool for Generating Exceptional Behavior Tests With Large Language Models

Abstract

arXiv:2505.22818v1 Announce Type: cross Abstract: Exceptional behavior tests (EBTs) are crucial in software development for verifying that code correctly handles unwanted events and throws appropriate exceptions. However, prior research has shown that developers often prioritize testing "happy paths", e.g., paths without unwanted events over exceptional scenarios. We present exLong, a framework that automatically generates EBTs to address this gap. exLong leverages a large language model (LLM) fine-tuned from CodeLlama and incorporates reasoning about exception-throwing traces, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. Our demonstration video illustrates how exLong can effectively assist developers in creating comprehensive EBTs for their project (available at https://youtu.be/Jro8kMgplZk).

摘要

异常行为测试（EBT）在软件开发中至关重要，用于验证代码能否正确处理非预期事件并抛出适当异常。然而，先前研究表明开发者往往优先测试"正常路径"（即无异常事件的路径）而非异常场景。为此，我们提出exLong框架，通过自动生成EBT来解决这一缺陷。exLong基于CodeLlama微调的大语言模型（LLM），整合了异常抛出轨迹推理、守卫throw语句的条件表达式，以及执行相似轨迹的非异常行为测试。我们的演示视频展现了exLong如何有效协助开发者为其项目创建全面的EBT（详见https://youtu.be/Jro8kMgplZk）。

Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems

Abstract

arXiv:2505.22771v1 Announce Type: cross Abstract: This study illustrates how incorporating feedback-oriented annotations into the scoring pipeline can enhance the accuracy of automated essay scoring (AES). This approach is demonstrated with the Persuasive Essays for Rating, Selecting, and Understanding Argumentative and Discourse Elements (PERSUADE) corpus. We integrate two types of feedback-driven annotations: those that identify spelling and grammatical errors, and those that highlight argumentative components. To illustrate how this method could be applied in real-world scenarios, we employ two LLMs to generate annotations -- a generative language model used for spell-correction and an encoder-based token classifier trained to identify and mark argumentative elements. By incorporating annotations into the scoring process, we demonstrate improvements in performance using encoder-based large language models fine-tuned as classifiers.

摘要

本研究阐述了在评分流程中引入以反馈为导向的标注如何提升自动作文评分（AES）的准确性。该方法通过"用于评分、筛选及理解议论文和语篇要素的说服性论文"（PERSUADE）语料库得到验证。我们整合了两种反馈驱动的标注类型：识别拼写和语法错误的标注，以及突出论证要素的标注。为展示该方法在实际场景中的应用，我们采用两种大语言模型生成标注——一个用于拼写校正的生成式语言模型，和一个经过训练能识别并标记论证要素的基于编码器的标记分类器。通过将标注纳入评分过程，我们证明了使用基于编码器的大语言模型作为微调分类器时性能的提升。

First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay

Abstract

arXiv:2505.22809v1 Announce Type: cross Abstract: Much work has been done on conversational LLM agents which directly assist human users with tasks. We present an alternative paradigm for interacting with LLM agents, which we call "overhearing agents". These overhearing agents do not actively participate in conversation -- instead, they "listen in" on human-to-human conversations and perform background tasks or provide suggestions to assist the user. In this work, we explore the overhearing agents paradigm through the lens of Dungeons & Dragons gameplay. We present an in-depth study using large multimodal audio-language models as overhearing agents to assist a Dungeon Master. We perform a human evaluation to examine the helpfulness of such agents and find that some large audio-language models have the emergent ability to perform overhearing agent tasks using implicit audio cues. Finally, we release Python libraries and our project code to support further research into the overhearing agents paradigm at https://github.com/zhudotexe/overhearing_agents.

摘要

关于直接协助人类用户完成任务对话式大语言模型（LLM）代理的研究已有大量成果。我们提出了一种与LLM代理交互的替代范式，称为"旁听代理"。这类代理并不主动参与对话，而是通过"监听"人与人之间的交谈来执行后台任务或提供辅助建议。本研究以《龙与地下城》游戏为场景探索旁听代理范式，采用大型多模态音频-语言模型作为辅助地下城主持人的旁听代理进行深入分析。通过人工评估检验此类代理的实用性，我们发现部分大型音频-语言模型具备利用隐含音频线索执行旁听代理任务的涌现能力。最后，我们在https://github.com/zhudotexe/overhearing_agents发布了支持旁听代理范式进一步研究的Python库及项目代码。

Dynamic Task Adaptation for Multi-Robot Manufacturing Systems with Large Language Models

Abstract

arXiv:2505.22804v1 Announce Type: cross Abstract: Recent manufacturing systems are increasingly adopting multi-robot collaboration to handle complex and dynamic environments. While multi-agent architectures support decentralized coordination among robot agents, they often face challenges in enabling real-time adaptability for unexpected disruptions without predefined rules. Recent advances in large language models offer new opportunities for context-aware decision-making to enable adaptive responses to unexpected changes. This paper presents an initial exploratory implementation of a large language model-enabled control framework for dynamic task reassignment in multi-robot manufacturing systems. A central controller agent leverages the large language model's ability to interpret structured robot configuration data and generate valid reassignments in response to robot failures. Experiments in a real-world setup demonstrate high task success rates in recovering from failures, highlighting the potential of this approach to improve adaptability in multi-robot manufacturing systems.

摘要

现代制造系统日益采用多机器人协作以应对复杂动态环境。虽然多智能体架构支持机器人代理间的去中心化协调，但在缺乏预定义规则的情况下，其实现针对意外中断的实时适应性仍面临挑战。大语言模型的最新进展为情境感知决策提供了新机遇，使其能够对突发变化做出自适应响应。本文提出了一种基于大语言模型的控制框架初步探索性实现，用于多机器人制造系统中的动态任务重分配。中央控制器代理利用大语言模型解析结构化机器人配置数据的能力，并针对机器人故障生成有效重分配方案。真实场景下的实验表明，该系统在故障恢复方面具有较高的任务成功率，凸显了该方法在提升多机器人制造系统适应性方面的潜力。

What Has Been Lost with Synthetic Evaluation?

Abstract

arXiv:2505.22830v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly used for data generation. However, creating evaluation benchmarks raises the bar for this emerging paradigm. Benchmarks must target specific phenomena, penalize exploiting shortcuts, and be challenging. Through two case studies, we investigate whether LLMs can meet these demands by generating reasoning over-text benchmarks and comparing them to those created through careful crowdsourcing. Specifically, we evaluate both the validity and difficulty of LLM-generated versions of two high-quality reading comprehension datasets: CondaQA, which evaluates reasoning about negation, and DROP, which targets reasoning about quantities. We find that prompting LLMs can produce variants of these datasets that are often valid according to the annotation guidelines, at a fraction of the cost of the original crowdsourcing effort. However, we show that they are less challenging for LLMs than their human-authored counterparts. This finding sheds light on what may have been lost by generating evaluation data with LLMs, and calls for critically reassessing the immediate use of this increasingly prevalent approach to benchmark creation.

摘要

大型语言模型（LLMs）正日益被用于数据生成。然而，为这一新兴范式创建评估基准提出了更高要求。基准必须针对特定现象、避免利用捷径取巧，并具备足够挑战性。通过两项案例研究，我们探究LLMs能否通过生成文本推理基准来满足这些需求，并将其与精心设计的众包基准进行比较。具体而言，我们评估了由LLMs生成的两个高质量阅读理解数据集的效度与难度：评估否定推理的CondaQA，以及针对数量推理的DROP。研究发现，通过提示工程，LLMs能以远低于原始众包成本的方式生成符合标注规范的基准变体。然而，这些生成数据对LLMs的挑战性显著低于人工编写版本。这一发现揭示了使用LLMs生成评估数据时可能存在的缺陷，并呼吁对这一日益普遍的基准创建方法进行批判性重估。

Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment

Abstract

arXiv:2505.22852v1 Announce Type: cross Abstract: CaMeL (Capabilities for Machine Learning) introduces a capability-based sandbox to mitigate prompt injection attacks in large language model (LLM) agents. While effective, CaMeL assumes a trusted user prompt, omits side-channel concerns, and incurs performance tradeoffs due to its dual-LLM design. This response identifies these issues and proposes engineering improvements to expand CaMeL's threat coverage and operational usability. We introduce: (1) prompt screening for initial inputs, (2) output auditing to detect instruction leakage, (3) a tiered-risk access model to balance usability and control, and (4) a verified intermediate language for formal guarantees. Together, these upgrades align CaMeL with best practices in enterprise security and support scalable deployment.

摘要

CaMeL（机器学习能力框架）提出了一种基于能力的沙箱方案，用于缓解大型语言模型（LLM）智能体中的提示注入攻击。虽然有效，但CaMeL假设用户提示可信、忽略了侧信道风险，且其双LLM设计导致性能折衷。本文指出这些问题并提出工程改进方案，以扩展CaMeL的威胁覆盖范围与操作实用性。我们引入：（1）初始输入的提示筛查机制；（2）检测指令泄露的输出审计；（3）平衡可用性与控制的分级风险访问模型；（4）具备形式化验证保证的中间语言。这些升级共同使CaMeL符合企业安全最佳实践，并支持可扩展部署。

BugWhisperer: Fine-Tuning LLMs for SoC Hardware Vulnerability Detection

Abstract

arXiv:2505.22878v1 Announce Type: cross Abstract: The current landscape of system-on-chips (SoCs) security verification faces challenges due to manual, labor-intensive, and inflexible methodologies. These issues limit the scalability and effectiveness of security protocols, making bug detection at the Register-Transfer Level (RTL) difficult. This paper proposes a new framework named BugWhisperer that utilizes a specialized, fine-tuned Large Language Model (LLM) to address these challenges. By enhancing the LLM's hardware security knowledge and leveraging its capabilities for text inference and knowledge transfer, this approach automates and improves the adaptability and reusability of the verification process. We introduce an open-source, fine-tuned LLM specifically designed for detecting security vulnerabilities in SoC designs. Our findings demonstrate that this tailored LLM effectively enhances the efficiency and flexibility of the security verification process. Additionally, we introduce a comprehensive hardware vulnerability database that supports this work and will further assist the research community in enhancing the security verification process.

摘要

当前系统级芯片(SoC)安全验证领域面临着手动操作、劳动密集且方法僵化等挑战，这些问题限制了安全协议的可扩展性和有效性，导致寄存器传输级(RTL)的错误检测困难。本文提出名为BugWhisperer的新型框架，通过采用专门优化的细粒度大语言模型(LLM)来解决这些问题。该方法通过增强LLM的硬件安全知识并利用其文本推理和知识迁移能力，实现了验证流程的自动化，同时提升了其适应性和可复用性。我们开发了一个专用于检测SoC设计安全漏洞的开源优化LLM。研究结果表明，该定制化LLM能有效提升安全验证流程的效率和灵活性。此外，我们还构建了一个全面的硬件漏洞数据库以支持本项研究，该数据库将进一步助力研究界完善安全验证流程。

Permissioned LLMs: Enforcing Access Control in Large Language Models

Abstract

arXiv:2505.22860v1 Announce Type: cross Abstract: In enterprise settings, organizational data is segregated, siloed and carefully protected by elaborate access control frameworks. These access control structures can completely break down if an LLM fine-tuned on the siloed data serves requests, for downstream tasks, from individuals with disparate access privileges. We propose Permissioned LLMs (PermLLM), a new class of LLMs that superimpose the organizational data access control structures on query responses they generate. We formalize abstractions underpinning the means to determine whether access control enforcement happens correctly over LLM query responses. Our formalism introduces the notion of a relevant response that can be used to prove whether a PermLLM mechanism has been implemented correctly. We also introduce a novel metric, called access advantage, to empirically evaluate the efficacy of a PermLLM mechanism. We introduce three novel PermLLM mechanisms that build on Parameter Efficient Fine-Tuning to achieve the desired access control. We furthermore present two instantiations of access advantage--(i) Domain Distinguishability Index (DDI) based on Membership Inference Attacks, and (ii) Utility Gap Index (UGI) based on LLM utility evaluation. We demonstrate the efficacy of our PermLLM mechanisms through extensive experiments on four public datasets (GPQA, RCV1, SimpleQA, and WMDP), in addition to evaluating the validity of DDI and UGI metrics themselves for quantifying access control in LLMs.

摘要

在企业环境中，组织数据通常被分隔存储、孤立管理，并通过复杂的访问控制框架进行严格保护。当基于孤立数据微调的大语言模型（LLM）为具有不同访问权限的个人提供下游任务服务时，这些访问控制结构可能会完全失效。我们提出权限控制大语言模型（PermLLM），这类新型LLM能在生成的查询响应中叠加组织数据访问控制结构。我们形式化了支撑访问控制在LLM查询响应中正确执行的抽象机制，引入"相关响应"概念用于验证PermLLM机制的正确实现。同时提出名为"访问优势"的新度量标准来实证评估PermLLM机制的有效性。我们基于参数高效微调技术开发了三种新型PermLLM机制以实现目标访问控制，并提出访问优势的两个具体实现：（1）基于成员推理攻击的领域可区分指数（DDI）；（2）基于LLM效用评估的效用差距指数（UGI）。通过在四个公开数据集（GPQA、RCV1、SimpleQA和WMDP）上的大量实验，我们验证了PermLLM机制的有效性，同时评估了DDI和UGI指标在量化LLM访问控制方面的有效性。

Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates

Abstract

arXiv:2505.22943v1 Announce Type: cross Abstract: While pre-trained multimodal representations (e.g., CLIP) have shown impressive capabilities, they exhibit significant compositional vulnerabilities leading to counterintuitive judgments. We introduce Multimodal Adversarial Compositionality (MAC), a benchmark that leverages large language models (LLMs) to generate deceptive text samples to exploit these vulnerabilities across different modalities and evaluates them through both sample-wise attack success rate and group-wise entropy-based diversity. To improve zero-shot methods, we propose a self-training approach that leverages rejection-sampling fine-tuning with diversity-promoting filtering, which enhances both attack success rate and sample diversity. Using smaller language models like Llama-3.1-8B, our approach demonstrates superior performance in revealing compositional vulnerabilities across various multimodal representations, including images, videos, and audios.

摘要

尽管预训练多模态表征（如CLIP）已展现出卓越性能，但其存在显著的组合脆弱性，导致产生反直觉判断。我们提出多模态对抗组合性基准（MAC），该基准利用大语言模型（LLMs）生成具有欺骗性的文本样本，通过样本级攻击成功率和基于群体熵的多样性评估，揭示不同模态间的脆弱性。为改进零样本方法，我们提出一种自训练策略，结合拒绝采样微调与多样性促进过滤机制，有效提升攻击成功率和样本多样性。采用Llama-3.1-8B等较小语言模型时，本方法在图像、视频及音频等多模态表征中均展现出更优的组合脆弱性揭示能力。

Abstract

arXiv:2505.22939v1 Announce Type: cross Abstract: A key task in certain democratic processes is to produce a concise slate of statements that proportionally represents the full spectrum of user opinions. This task is similar to committee elections, but unlike traditional settings, the candidate set comprises all possible statements of varying lengths, and so it can only be accessed through specific queries. Combining social choice and large language models, prior work has approached this challenge through a framework of generative social choice. We extend the framework in two fundamental ways, providing theoretical guarantees even in the face of approximately optimal queries and a budget limit on the overall length of the slate. Using GPT-4o to implement queries, we showcase our approach on datasets related to city improvement measures and drug reviews, demonstrating its effectiveness in generating representative slates from unstructured user opinions.

摘要

特定民主进程中的关键任务是生成一份简明陈述清单，按比例反映用户意见的全貌。该任务类似于委员会选举，但与传统场景不同，候选集包含所有可能的不同长度陈述，因此只能通过特定查询访问。结合社会选择理论与大型语言模型，先前研究通过生成式社会选择框架应对这一挑战。我们从两个根本维度拓展该框架：即使面对近似最优查询和清单总长度预算限制，仍能提供理论保证。利用GPT-4o实现查询功能，我们在城市改善措施和药物评价数据集上验证了该方法，证明其能从非结构化用户意见中有效生成代表性陈述清单。

HiLDe: Intentional Code Generation via Human-in-the-Loop Decoding

Abstract

arXiv:2505.22906v1 Announce Type: cross Abstract: While AI programming tools hold the promise of increasing programmers' capabilities and productivity to a remarkable degree, they often exclude users from essential decision-making processes, causing many to effectively "turn off their brains" and over-rely on solutions provided by these systems. These behaviors can have severe consequences in critical domains, like software security. We propose Human-in-the-loop Decoding, a novel interaction technique that allows users to observe and directly influence LLM decisions during code generation, in order to align the model's output with their personal requirements. We implement this technique in HiLDe, a code completion assistant that highlights critical decisions made by the LLM and provides local alternatives for the user to explore. In a within-subjects study (N=18) on security-related tasks, we found that HiLDe led participants to generate significantly fewer vulnerabilities and better align code generation with their goals compared to a traditional code completion assistant.

摘要

尽管人工智能编程工具有望显著提升程序员的能力与生产力，但这些工具往往将用户排除在关键决策流程之外，导致许多使用者实质上"停止思考"并过度依赖系统提供的解决方案。在软件安全等关键领域，此类行为可能造成严重后果。我们提出"人在回路解码"这一新型交互技术，使用户能够在代码生成过程中观察并直接影响大语言模型的决策，从而使模型输出符合其个性化需求。基于该技术，我们开发了HiLDe代码补全助手，其可突出显示大语言模型的关键决策点，并提供局部替代方案供用户选择。在一项涉及18名参与者的安全相关任务受试者内研究中发现，与传统代码补全工具相比，HiLDe能显著减少参与者代码中的漏洞数量，并使其代码生成结果更符合预期目标。

WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning

Abstract

arXiv:2505.22942v1 Announce Type: cross Abstract: Large language models (LLMs)-empowered web agents enables automating complex, real-time web navigation tasks in enterprise environments. However, existing web agents relying on supervised fine-tuning (SFT) often struggle with generalization and robustness due to insufficient reasoning capabilities when handling the inherently dynamic nature of web interactions. In this study, we introduce WorkForceAgent-R1, an LLM-based web agent trained using a rule-based R1-style reinforcement learning framework designed explicitly to enhance single-step reasoning and planning for business-oriented web navigation tasks. We employ a structured reward function that evaluates both adherence to output formats and correctness of actions, enabling WorkForceAgent-R1 to implicitly learn robust intermediate reasoning without explicit annotations or extensive expert demonstrations. Extensive experiments on the WorkArena benchmark demonstrate that WorkForceAgent-R1 substantially outperforms SFT baselines by 10.26-16.59%, achieving competitive performance relative to proprietary LLM-based agents (gpt-4o) in workplace-oriented web navigation tasks.

摘要

基于大语言模型（LLMs）的网页代理能够实现企业环境中复杂、实时网页导航任务的自动化。然而，现有依赖监督微调（SFT）的网页代理在处理网页交互固有的动态特性时，由于推理能力不足，往往面临泛化性和鲁棒性挑战。本研究提出WorkForceAgent-R1，这是一种基于LLM的网页代理，采用基于规则的R1风格强化学习框架进行训练，该框架专为增强面向业务网页导航任务的单步推理与规划能力而设计。我们采用结构化奖励函数，同时评估输出格式的合规性和动作的正确性，使WorkForceAgent-R1能够隐式学习鲁棒的中间推理过程，而无需显式标注或大量专家示范。在WorkArena基准上的大量实验表明，WorkForceAgent-R1以10.26-16.59%的显著优势超越SFT基线模型，并在面向工作场景的网页导航任务中达到与专有LLM代理（gpt-4o）相竞争的性能水平。

MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming

Abstract

arXiv:2505.22967v1 Announce Type: cross Abstract: Despite the promise of autonomous agentic reasoning, existing workflow generation methods frequently produce fragile, unexecutable plans due to unconstrained LLM-driven construction. We introduce MermaidFlow, a framework that redefines the agentic search space through safety-constrained graph evolution. At its core, MermaidFlow represent workflows as a verifiable intermediate representation using Mermaid, a structured and human-interpretable graph language. We formulate domain-aware evolutionary operators, i.e., crossover, mutation, insertion, and deletion, to preserve semantic correctness while promoting structural diversity, enabling efficient exploration of a high-quality, statically verifiable workflow space. Without modifying task settings or evaluation protocols, MermaidFlow achieves consistent improvements in success rates and faster convergence to executable plans on the agent reasoning benchmark. The experimental results demonstrate that safety-constrained graph evolution offers a scalable, modular foundation for robust and interpretable agentic reasoning systems.

摘要

尽管自主代理推理技术前景广阔，但现有工作流生成方法由于采用无约束的大语言模型驱动构建，常产生脆弱且不可执行的计划。我们提出MermaidFlow框架，该框架通过安全约束的图演化重新定义了代理搜索空间。其核心在于使用结构化且人类可理解的图形语言Mermaid，将工作流表示为可验证的中间表示。我们设计了领域感知的演化算子（包括交叉、变异、插入和删除），在保持语义正确性的同时促进结构多样性，从而实现对高质量、静态可验证工作流空间的高效探索。在不修改任务设置或评估协议的情况下，MermaidFlow在代理推理基准测试中实现了成功率的一致提升，并能更快收敛至可执行计划。实验结果表明，安全约束的图演化为构建鲁棒且可解释的代理推理系统提供了可扩展、模块化的基础。

NegVQA: Can Vision Language Models Understand Negation?

Abstract

arXiv:2505.22946v1 Announce Type: cross Abstract: Negation is a fundamental linguistic phenomenon that can entirely reverse the meaning of a sentence. As vision language models (VLMs) continue to advance and are deployed in high-stakes applications, assessing their ability to comprehend negation becomes essential. To address this, we introduce NegVQA, a visual question answering (VQA) benchmark consisting of 7,379 two-choice questions covering diverse negation scenarios and image-question distributions. We construct NegVQA by leveraging large language models to generate negated versions of questions from existing VQA datasets. Evaluating 20 state-of-the-art VLMs across seven model families, we find that these models struggle significantly with negation, exhibiting a substantial performance drop compared to their responses to the original questions. Furthermore, we uncover a U-shaped scaling trend, where increasing model size initially degrades performance on NegVQA before leading to improvements. Our benchmark reveals critical gaps in VLMs' negation understanding and offers insights into future VLM development. Project page available at https://yuhui-zh15.github.io/NegVQA/.

摘要

否定是一种基本语言现象，能彻底逆转句子的含义。随着视觉语言模型（VLMs）的不断进步及其在高风险应用中的部署，评估其理解否定的能力变得至关重要。为此，我们提出NegVQA——一个包含7,379道二选一问题的视觉问答（VQA）基准测试，涵盖多样化否定场景与图像-问题分布。该数据集通过利用大语言模型对现有VQA数据集中的问题生成否定版本构建而成。通过对7个模型家族的20个前沿VLM进行评估，我们发现这些模型在否定理解上存在显著困难，其表现较原始问题出现大幅下降。此外，我们观察到U型规模扩展趋势：增加模型尺寸会先导致NegVQA性能下降，而后才产生改善。本基准测试揭示了VLMs在否定理解上的关键缺陷，为未来VLM发展提供了重要洞见。项目页面详见https://yuhui-zh15.github.io/NegVQA/。

Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking

Abstract

arXiv:2505.22922v1 Announce Type: cross Abstract: Fueled by their remarkable ability to tackle diverse tasks across multiple domains, large language models (LLMs) have grown at an unprecedented rate, with some recent models containing trillions of parameters. This growth is accompanied by substantial computational challenges, particularly regarding the memory and compute resources required for training and fine-tuning. Numerous approaches have been explored to address these issues, such as LoRA. While these methods are effective for fine-tuning, their application to pre-training is significantly more challenging due to the need to learn vast datasets. Motivated by this issue, we aim to address the following questions: Can parameter- or memory-efficient methods enhance pre-training efficiency while achieving performance comparable to full-model training? How can the performance gap be narrowed? To this end, the contributions of this work are the following. (1) We begin by conducting a comprehensive survey that summarizes state-of-the-art methods for efficient pre-training. (2) We perform a benchmark evaluation of several representative memory efficient pre-training approaches to comprehensively evaluate their performance across model sizes. We observe that with a proper choice of optimizer and hyperparameters, full-rank training delivers the best performance, as expected. We also notice that incorporating high-rank updates in low-rank approaches is the key to improving their performance. (3) Finally, we propose two practical techniques, namely weight refactorization and momentum reset, to enhance the performance of efficient pre-training methods. We observe that applying these techniques to the low-rank method (on a 1B model) can achieve a lower perplexity than popular memory efficient algorithms such as GaLore and Fira, while simultaneously using about 25% less memory.

摘要

得益于大语言模型（LLMs）在跨领域多样化任务中展现出的卓越能力，其规模正以空前速度增长，部分最新模型已包含数万亿参数。这种增长伴随着巨大的计算挑战，尤其是训练与微调过程中对内存和计算资源的需求。尽管已有诸多方法（如LoRA）被探索用于解决这些问题，但这些技术在预训练阶段的应用面临更大挑战，因为预训练需要学习海量数据集。基于此问题，我们旨在探讨以下问题：参数或内存高效方法能否在保持与全模型训练相当性能的同时提升预训练效率？如何缩小性能差距？为此，本研究作出以下贡献：（1）首先对高效预训练的前沿方法进行全面综述；（2）对几种代表性内存高效预训练方法进行基准评估，系统考察其在不同模型规模下的性能表现。我们发现，在优化器和超参数选择适当时，全秩训练如预期般展现出最佳性能。同时注意到，在低秩方法中引入高秩更新是提升性能的关键；（3）最终提出权重重构和动量重置两项实用技术以增强高效预训练方法的性能。实验表明，在10亿参数模型上应用这些技术的低秩方法，其困惑度可低于GaLore、Fira等流行内存高效算法，同时内存占用减少约25%。

Exploring Scaling Laws for EHR Foundation Models

Abstract

arXiv:2505.22964v1 Announce Type: cross Abstract: The emergence of scaling laws has profoundly shaped the development of large language models (LLMs), enabling predictable performance gains through systematic increases in model size, dataset volume, and compute. Yet, these principles remain largely unexplored in the context of electronic health records (EHRs) -- a rich, sequential, and globally abundant data source that differs structurally from natural language. In this work, we present the first empirical investigation of scaling laws for EHR foundation models. By training transformer architectures on patient timeline data from the MIMIC-IV database across varying model sizes and compute budgets, we identify consistent scaling patterns, including parabolic IsoFLOPs curves and power-law relationships between compute, model parameters, data size, and clinical utility. These findings demonstrate that EHR models exhibit scaling behavior analogous to LLMs, offering predictive insights into resource-efficient training strategies. Our results lay the groundwork for developing powerful EHR foundation models capable of transforming clinical prediction tasks and advancing personalized healthcare.

摘要

缩放定律的出现深刻影响了大型语言模型（LLMs）的发展，通过系统性地增加模型规模、数据集体量和计算资源，实现了可预测的性能提升。然而，这些原则在电子健康记录（EHRs）领域仍鲜有探索——EHR作为一种结构上与自然语言不同的、丰富且具时序性的全球海量数据源。本研究首次对EHR基础模型的缩放定律进行实证探究。通过在MIMIC-IV数据库的患者时间线数据上训练不同规模及计算预算的Transformer架构，我们发现了包括抛物线型IsoFLOPs曲线、以及计算量/模型参数量/数据规模与临床效用之间的幂律关系等一致的缩放规律。这些结果表明EHR模型展现出与LLMs类似的缩放行为，为资源高效训练策略提供了预测性见解。我们的研究成果为开发能够革新临床预测任务、推动个性化医疗的强大EHR基础模型奠定了理论基础。

Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging

Abstract

arXiv:2505.22934v1 Announce Type: cross Abstract: Fine-tuning large language models (LMs) for individual tasks yields strong performance but is expensive for deployment and storage. Recent works explore model merging to combine multiple task-specific models into a single multi-task model without additional training. However, existing merging methods often fail for models fine-tuned with low-rank adaptation (LoRA), due to significant performance degradation. In this paper, we show that this issue arises from a previously overlooked interplay between model parameters and data distributions. We propose Orthogonal Subspaces for Robust model Merging (OSRM) to constrain the LoRA subspace prior to fine-tuning, ensuring that updates relevant to one task do not adversely shift outputs for others. Our approach can seamlessly integrate with most existing merging algorithms, reducing the unintended interference among tasks. Extensive experiments on eight datasets, tested with three widely used LMs and two large LMs, demonstrate that our method not only boosts merging performance but also preserves single-task accuracy. Furthermore, our approach exhibits greater robustness to the hyperparameters of merging. These results highlight the importance of data-parameter interaction in model merging and offer a plug-and-play solution for merging LoRA models.

摘要

针对单个任务微调大型语言模型（LM）虽能获得强劲性能，但存在部署和存储成本高昂的问题。近期研究探索通过模型融合将多个任务专用模型合并为单一多任务模型，且无需额外训练。然而现有融合方法在应用于低秩自适应（LoRA）微调的模型时，常因性能显著下降而失效。本文揭示该问题源于模型参数与数据分布之间长期被忽视的相互作用。我们提出正交子空间鲁棒模型融合（OSRM）方法，在微调前约束LoRA子空间，确保针对某任务的更新不会对其他任务的输出产生负面偏移。该方法可与多数现有融合算法无缝集成，减少任务间的意外干扰。在八个数据集上的大量实验表明，使用三种广泛使用的LM和两种大型LM测试时，我们的方法不仅能提升融合性能，还能保持单任务准确率。此外，本方案对融合超参数表现出更强的鲁棒性。这些结果揭示了数据-参数交互在模型融合中的重要性，并为LoRA模型融合提供了即插即用的解决方案。

Model-Preserving Adaptive Rounding

Abstract

arXiv:2505.22988v1 Announce Type: cross Abstract: The main goal of post-training quantization (PTQ) is to produced a compressed model whose output distribution is as close to the original model's as possible. To do this tractably, almost all LLM PTQ algorithms quantize linear layers by independently minimizing the immediate activation error. However, this localized objective ignores the effect of subsequent layers, so reducing it does not necessarily give a closer model. In this work, we introduce Yet Another Quantization Algorithm (YAQA), an adaptive rounding algorithm that uses Kronecker-factored approximations of each linear layer's Hessian with respect to the \textit{full model} KL divergence. YAQA consists of two components: Kronecker-factored sketches of the full layerwise Hessian that can be tractably computed for hundred-billion parameter LLMs, and a quantizer-independent rounding algorithm that uses these sketches and comes with theoretical guarantees. Across a wide range of models and quantizers, YAQA empirically reduces the KL divergence to the original model by $\approx 30\%$ while achieving state of the art performance on downstream tasks.

摘要

训练后量化（PTQ）的主要目标是生成一个压缩模型，其输出分布尽可能接近原始模型。为实现这一目标，几乎所有大语言模型PTQ算法都通过独立最小化即时激活误差来量化线性层。然而，这种局部化目标忽略了后续层的影响，因此降低该误差未必能获得更接近的模型。本研究提出"另一种量化算法"（YAQA），这是一种基于克罗内克分解近似各线性层关于全模型KL散度海森矩阵的自适应舍入算法。YAQA包含两个组件：可对千亿参数大语言模型进行高效计算的全层级海森矩阵克罗内克分解草图，以及利用这些草图并具备理论保证的量化器无关舍入算法。在多种模型和量化器上的实验表明，YAQA能将原始模型的KL散度降低约30%，同时在下游任务中达到最先进性能。

OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature

Abstract

arXiv:2505.22945v1 Announce Type: cross Abstract: Large language models (LLMs) are known to memorize and recall English text from their pretraining data. However, the extent to which this ability generalizes to non-English languages or transfers across languages remains unclear. This paper investigates multilingual and cross-lingual memorization in LLMs, probing if memorized content in one language (e.g., English) can be recalled when presented in translation. To do so, we introduce OWL, a dataset of 31.5K aligned excerpts from 20 books in ten languages, including English originals, official translations (Vietnamese, Spanish, Turkish), and new translations in six low-resource languages (Sesotho, Yoruba, Maithili, Malagasy, Setswana, Tahitian). We evaluate memorization across model families and sizes through three tasks: (1) direct probing, which asks the model to identify a book's title and author; (2) name cloze, which requires predicting masked character names; and (3) prefix probing, which involves generating continuations. We find that LLMs consistently recall content across languages, even for texts without direct translation in pretraining data. GPT-4o, for example, identifies authors and titles 69% of the time and masked entities 6% of the time in newly translated excerpts. Perturbations (e.g., masking characters, shuffling words) modestly reduce direct probing accuracy (7% drop for shuffled official translations). Our results highlight the extent of cross-lingual memorization and provide insights on the differences between the models.

摘要

已知大型语言模型（LLMs）能够记忆并回忆预训练数据中的英文文本，但其对非英语语言的泛化能力及跨语言迁移程度尚不明确。本文研究LLMs的多语言与跨语言记忆机制，探究模型是否能通过翻译呈现来回忆源语言（如英语）的记忆内容。为此，我们构建了OWL数据集，包含十种语言的20本书籍中3.15万条对齐文本片段，涵盖英语原文、官方译本（越南语、西班牙语、土耳其语）及六种低资源语言（塞索托语、约鲁巴语、迈蒂利语、马尔加什语、茨瓦纳语、塔希提语）的新译本。通过三项任务评估不同模型系列及规模的记忆能力：（1）直接探测：要求模型识别书籍标题与作者；（2）名称填空：预测被遮蔽的人物名称；（3）前缀探测：生成后续文本。研究发现，LLMs能跨语言稳定回忆内容，即使预训练数据中无直接翻译的文本亦如此。例如GPT-4o在新翻译文本中识别作者与标题的准确率达69%，遮蔽实体预测准确率为6%。干扰操作（如遮蔽字符、打乱词序）会适度降低直接探测准确率（官方译本词序打乱后下降7%）。本研究揭示了跨语言记忆的广度，并为模型间的差异性提供了新见解。

A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs

Abstract

arXiv:2505.23006v1 Announce Type: cross Abstract: The advancement of Large Language Models (LLMs) has led to significant improvements in various service domains, including search, recommendation, and chatbot applications. However, applying state-of-the-art (SOTA) research to industrial settings presents challenges, as it requires maintaining flexible conversational abilities while also strictly complying with service-specific constraints. This can be seen as two conflicting requirements due to the probabilistic nature of LLMs. In this paper, we propose our approach to addressing this challenge and detail the strategies we employed to overcome their inherent limitations in real-world applications. We conduct a practical case study of a conversational agent designed for the e-commerce domain, detailing our implementation workflow and optimizations. Our findings provide insights into bridging the gap between academic research and real-world application, introducing a framework for developing scalable, controllable, and reliable AI-driven agents.

摘要

大型语言模型（LLM）的进步显著提升了搜索、推荐和聊天机器人等多个服务领域的性能。然而，将最先进（SOTA）研究成果应用于工业场景面临挑战，因为这需要在保持灵活对话能力的同时，严格遵循特定服务的约束条件。由于LLM的 probabilistic 特性，这两项要求可被视为相互冲突。本文提出解决这一挑战的方法，并详细阐述我们在实际应用中克服其固有局限性的策略。我们以电商领域对话智能体为案例进行实证研究，具体说明实施流程与优化方案。研究结果为弥合学术研究与实际应用之间的差距提供了见解，同时提出一个可扩展、可控且可靠的AI驱动智能体开发框架。

AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models

Abstract

arXiv:2505.23020v1 Announce Type: cross Abstract: The acquisition of agentic capabilities has transformed LLMs from "knowledge providers" to "action executors", a trend that while expanding LLMs' capability boundaries, significantly increases their susceptibility to malicious use. Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked, indicating a deficiency in agentic use safety alignment during the post-training phase. To address this gap, we propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis. By instantiating these behavior chains in simulated environments with diverse tool instances, our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics. The framework further ensures model utility by proportionally synthesizing benign instructions through non-malicious interpretations of behavior chains, precisely calibrating the boundary between helpfulness and harmlessness. Evaluation results on AgentHarm demonstrate that fine-tuning three families of open-source models using our method substantially improves their safety (35.8% to 79.5% improvement) while minimally impacting or even positively enhancing their helpfulness, outperforming various prompting methods. The dataset and code have both been open-sourced.

摘要

智能体能力的获取使大语言模型从"知识提供者"转变为"行为执行者"，这一趋势在拓展模型能力边界的同时，也显著增加了其被恶意利用的风险。已有研究表明，当前基于大语言模型的智能体即使未受攻击也会执行大量恶意任务，这表明训练后阶段存在智能体使用安全对齐的缺陷。为解决这一问题，我们提出AgentAlign框架，该框架通过抽象行为链作为安全对齐数据合成的媒介。通过在模拟环境中用多样化工具实例实例化这些行为链，我们的框架既能生成高度真实且可执行的指令，又能捕捉复杂的多步动态。框架还通过行为链的非恶意解释按比例合成良性指令，精确校准有用性与无害性之间的边界，从而确保模型效用。在AgentHarm基准上的评估结果表明，使用本方法微调的三大开源模型家族安全性显著提升（35.8%至79.5%），同时对实用性影响极小甚至产生正向提升，其效果优于各类提示方法。数据集与代码均已开源。

Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation

Abstract

arXiv:2505.22993v1 Announce Type: cross Abstract: Claim verification is a long-standing and challenging task that demands not only high accuracy but also explainability of the verification process. This task becomes an emerging research issue in the era of large language models (LLMs) since real-world claims are often complex, featuring intricate semantic structures or obfuscated entities. Traditional approaches typically address this by decomposing claims into sub-claims and querying a knowledge base to resolve hidden or ambiguous entities. However, the absence of effective disambiguation strategies for these entities can compromise the entire verification process. To address these challenges, we propose Verify-in-the-Graph (VeGraph), a novel framework leveraging the reasoning and comprehension abilities of LLM agents. VeGraph operates in three phases: (1) Graph Representation - an input claim is decomposed into structured triplets, forming a graph-based representation that integrates both structured and unstructured information; (2) Entity Disambiguation -VeGraph iteratively interacts with the knowledge base to resolve ambiguous entities within the graph for deeper sub-claim verification; and (3) Verification - remaining triplets are verified to complete the fact-checking process. Experiments using Meta-Llama-3-70B (instruct version) show that VeGraph achieves competitive performance compared to baselines on two benchmarks HoVer and FEVEROUS, effectively addressing claim verification challenges. Our source code and data are available for further exploitation.

摘要

声明验证是一项长期存在且具有挑战性的任务，不仅要求高准确性，还需要验证过程的可解释性。在大语言模型（LLM）时代，由于现实世界中的声明往往复杂多变，具有错综复杂的语义结构或模糊实体，这一任务成为新兴的研究课题。传统方法通常通过将声明分解为子声明并查询知识库来解决隐藏或模糊的实体。然而，缺乏对这些实体的有效消歧策略可能会影响整个验证过程。为解决这些挑战，我们提出了Verify-in-the-Graph（VeGraph），一个利用LLM代理推理和理解能力的新型框架。VeGraph分为三个阶段运行：（1）图表示——将输入声明分解为结构化三元组，形成基于图的表示，整合结构化和非结构化信息；（2）实体消歧——VeGraph与知识库迭代交互，解决图中的模糊实体，以进行更深入的子声明验证；（3）验证——验证剩余的三元组以完成事实核查过程。使用Meta-Llama-3-70B（指令版本）的实验表明，与HoVer和FEVEROUS两个基准测试的基线相比，VeGraph实现了具有竞争力的性能，有效解决了声明验证的挑战。我们的源代码和数据可供进一步利用。

Context Robust Knowledge Editing for Language Models

Abstract

arXiv:2505.23026v1 Announce Type: cross Abstract: Knowledge editing (KE) methods offer an efficient way to modify knowledge in large language models. Current KE evaluations typically assess editing success by considering only the edited knowledge without any preceding contexts. In real-world applications, however, preceding contexts often trigger the retrieval of the original knowledge and undermine the intended edit. To address this issue, we develop CHED -- a benchmark designed to evaluate the context robustness of KE methods. Evaluations on CHED show that they often fail when preceding contexts are present. To mitigate this shortcoming, we introduce CoRE, a KE method designed to strengthen context robustness by minimizing context-sensitive variance in hidden states of the model for edited knowledge. This method not only improves the editing success rate in situations where a preceding context is present but also preserves the overall capabilities of the model. We provide an in-depth analysis of the differing impacts of preceding contexts when introduced as user utterances versus assistant responses, and we dissect attention-score patterns to assess how specific tokens influence editing success.

摘要

知识编辑（KE）方法为修改大语言模型中的知识提供了一种高效途径。当前KE评估通常仅考虑被编辑知识本身而忽略前置语境来判断编辑是否成功。然而在实际应用中，前置语境往往会触发原始知识的检索，从而破坏预期编辑效果。针对这一问题，我们开发了CHED基准——专门用于评估KE方法的语境鲁棒性。在CHED上的测试表明，当前方法在前置语境存在时经常失效。为改善这一缺陷，我们提出了CoRE方法，该技术通过最小化被编辑知识在模型隐藏状态中的语境敏感方差来增强语境鲁棒性。该方法不仅提高了存在前置语境时的编辑成功率，同时保持了模型的整体能力。我们深入分析了前置语境作为用户话语与助手响应时产生的差异化影响，并通过解构注意力分数模式来评估特定标记对编辑成功的影响机制。

Augment or Not? A Comparative Study of Pure and Augmented Large Language Model Recommenders

Abstract

arXiv:2505.23053v1 Announce Type: cross Abstract: Large language models (LLMs) have introduced new paradigms for recommender systems by enabling richer semantic understanding and incorporating implicit world knowledge. In this study, we propose a systematic taxonomy that classifies existing approaches into two categories: (1) Pure LLM Recommenders, which rely solely on LLMs, and (2) Augmented LLM Recommenders, which integrate additional non-LLM techniques to enhance performance. This taxonomy provides a novel lens through which to examine the evolving landscape of LLM-based recommendation. To support fair comparison, we introduce a unified evaluation platform that benchmarks representative models under consistent experimental settings, highlighting key design choices that impact effectiveness. We conclude by discussing open challenges and outlining promising directions for future research. This work offers both a comprehensive overview and practical guidance for advancing next-generation LLM-powered recommender.

摘要

大语言模型（LLMs）通过实现更丰富的语义理解与隐式世界知识融合，为推荐系统引入了新范式。本研究提出系统性分类法，将现有方法划分为两类：（1）纯LLM推荐系统，完全依赖LLM；（2）增强型LLM推荐系统，整合额外非LLM技术以提升性能。该分类法为审视基于LLM的推荐系统演进提供了新视角。为支持公平比较，我们构建统一评估平台，在一致实验环境下对代表性模型进行基准测试，揭示影响效能的关键设计选择。最后探讨开放挑战并展望未来研究方向。本工作既提供了全面综述，也为推进下一代LLM驱动的推荐系统发展提供了实践指导。

From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval

Abstract

arXiv:2505.23059v1 Announce Type: cross Abstract: Chain-of-Thought (CoT) prompting enables complex reasoning in large language models (LLMs), including applications in information retrieval (IR). However, it often leads to overthinking, where models produce excessively long and semantically redundant traces with little or no benefit. We identify two key challenges in IR: redundant trajectories that revisit similar states and misguided reasoning that diverges from user intent. To address these, we propose State Machine Reasoning (SMR), a transition-based reasoning framework composed of discrete actions (Refine, Rerank, Stop) that support early stopping and fine-grained control. Experiments on the BEIR and BRIGHT benchmarks show that SMR improves retrieval performance (nDCG@10) by 3.4% while reducing token usage by 74.4%. It generalizes across LLMs and retrievers without requiring task-specific tuning, offering a practical alternative to conventional CoT reasoning. The code and details are available at https://github.com/ldilab/SMR.

摘要

思维链（CoT）提示使大型语言模型（LLM）能够进行复杂推理，包括在信息检索（IR）中的应用。然而，这种方法常导致过度思考，即模型生成冗长且语义冗余的推理轨迹，却收效甚微。我们识别出IR中的两个关键挑战：重复访问相似状态的冗余轨迹，以及偏离用户意图的错误推理。为此，我们提出状态机推理（SMR），这是一种基于状态转移的推理框架，由离散动作（优化、重排、停止）组成，支持提前停止和细粒度控制。在BEIR和BRIGHT基准测试上的实验表明，SMR将检索性能（nDCG@10）提高了3.4%，同时减少了74.4%的token使用量。该框架无需针对特定任务进行调整，即可在不同LLM和检索器上泛化，为传统CoT推理提供了实用替代方案。代码及详情见https://github.com/ldilab/SMR。

Abstract

arXiv:2505.23121v1 Announce Type: cross Abstract: Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.

摘要

多模态大语言模型已展现出卓越的零样本能力和强大的图像理解性能。然而现有开源多模态模型存在多轮交互能力薄弱的问题，尤其针对长上下文场景。为解决该问题，我们首先提出了一种称为ContextQFormer的上下文建模模块，该模块通过记忆块机制增强上下文信息的表征能力。为进一步推动相关研究，我们精心构建了新型多轮多模态对话数据集（TMDialog），涵盖预训练、指令微调和评估全流程，该数据集将于近期开源。相较于其他多模态对话数据集，TMDialog包含更长对话轮次，可有效支持多轮多模态对话研究。实验部分将ContextQFormer与三种基线模型在TMDialog上进行对比，结果表明ContextQFormer在可用率指标上较基线模型提升2%-4%。

Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking

Abstract

arXiv:2505.23117v1 Announce Type: cross Abstract: In the era of large-scale training, model merging has evolved into a tool for creating multitasking models efficiently. It enables the knowledge of models to be fused, without the need for heavy computation as required in traditional multitask learning. Existing merging methods often assume that entries at identical positions in weight matrices serve the same function, enabling straightforward entry-wise comparison and merging. However, this assumption overlooks the complexity of finetuned neural networks, where neurons may develop distinct feature compositions, making direct entry-wise merging problematic. We present Decom-Renorm-Merge (DRM), a simple yet effective approach that leverages Singular Value Decomposition to decompose and coordinate weight matrices into an aligned joint space, where entry-wise merging becomes possible. We showcase the effectiveness of DRM across various settings ranging from smaller encoder-based such as ViT and DeBERTa, encoder-decoder-based such as T5, and larger decoder-based such as Llama3.1-8B. Our experimental results show that DRM outperforms several state-of-the-art merging techniques across full finetuning and low-rank adaptation settings. Moreover, our analysis reveals renormalization as the crucial component for creating a robust and even joint space for merging, significantly contributing to the method's performance.

摘要

在大规模训练时代，模型融合已发展成为高效创建多任务模型的工具。该方法能够实现模型知识的融合，而无需传统多任务学习所需的大量计算。现有融合方法通常假设权重矩阵相同位置的条目具有相同功能，从而支持直接的逐项比较与融合。然而这种假设忽视了微调神经网络的复杂性——神经元可能形成不同的特征组合，导致直接逐项融合存在问题。我们提出分解-重归一化-融合（DRM）方法，通过奇异值分解将权重矩阵分解并协调至对齐的联合空间，从而实现有效的逐项融合。我们在多种模型架构上验证了DRM的有效性，包括基于较小编码器的ViT和DeBERTa、基于编码器-解码器的T5，以及基于更大解码器的Llama3.1-8B。实验结果表明，DRM在完全微调和低秩适配场景下均优于当前最先进的融合技术。进一步分析表明，重归一化是创建鲁棒且均匀融合空间的关键环节，对方法性能提升具有显著贡献。

ExpeTrans: LLMs Are Experiential Transfer Learners

Abstract

arXiv:2505.23191v1 Announce Type: cross Abstract: Recent studies provide large language models (LLMs) with textual task-solving experiences via prompts to improve their performance. However, previous methods rely on substantial human labor or time to gather such experiences for each task, which is impractical given the growing variety of task types in user queries to LLMs. To address this issue, we design an autonomous experience transfer framework to explore whether LLMs can mimic human cognitive intelligence to autonomously transfer experience from existing source tasks to newly encountered target tasks. This not only allows the acquisition of experience without extensive costs of previous methods, but also offers a novel path for the generalization of LLMs. Experimental results on 13 datasets demonstrate that our framework effectively improves the performance of LLMs. Furthermore, we provide a detailed analysis of each module in the framework.

摘要

近期研究通过提示词为大型语言模型（LLMs）提供文本任务解决经验以提升其性能。然而传统方法依赖大量人力或时间成本为每个任务收集此类经验，鉴于用户向LLMs提出的任务类型日益多样化，这种模式已不具可行性。为解决该问题，我们设计了一种自主经验迁移框架，旨在探究LLMs能否模拟人类认知智能，将既有源任务的经验自主迁移至新接触的目标任务。这种方法不仅无需传统方法的高成本即可获取经验，还为LLMs的泛化能力提供了新路径。在13个数据集上的实验结果表明，本框架能有效提升LLMs性能。此外，我们对框架中各模块进行了详细分析。

Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration

Abstract

arXiv:2505.23187v1 Announce Type: cross Abstract: Large Language Model-based multi-agent systems (MAS) have shown remarkable progress in solving complex tasks through collaborative reasoning and inter-agent critique. However, existing approaches typically treat each task in isolation, resulting in redundant computations and limited generalization across structurally similar tasks. To address this, we introduce multi-agent cross-task experiential learning (MAEL), a novel framework that endows LLM-driven agents with explicit cross-task learning and experience accumulation. We model the task-solving workflow on a graph-structured multi-agent collaboration network, where agents propagate information and coordinate via explicit connectivity. During the experiential learning phase, we quantify the quality for each step in the task-solving workflow and store the resulting rewards along with the corresponding inputs and outputs into each agent's individual experience pool. During inference, agents retrieve high-reward, task-relevant experiences as few-shot examples to enhance the effectiveness of each reasoning step, thereby enabling more accurate and efficient multi-agent collaboration. Experimental results on diverse datasets demonstrate that MAEL empowers agents to learn from prior task experiences effectively-achieving faster convergence and producing higher-quality solutions on current tasks.

摘要

基于大语言模型的多智能体系统（MAS）通过协作推理与智能体间互评机制，在解决复杂任务方面展现出显著进展。然而现有方法通常孤立处理每项任务，导致计算冗余且难以在结构相似任务间实现泛化。为此，我们提出多智能体跨任务经验学习框架（MAEL），该创新框架赋予大语言模型驱动的智能体显式的跨任务学习与经验积累能力。我们在图结构的多智能体协作网络上建模任务求解流程，智能体通过显式连接进行信息传播与协同。在经验学习阶段，我们量化任务求解流程中每个步骤的质量，并将产生的奖励与对应输入输出存入各智能体的独立经验池。推理过程中，智能体检索高奖励值的任务相关经验作为少样本示例，以提升每个推理步骤的效能，从而实现更精准高效的多智能体协作。多样化数据集上的实验结果表明，MAEL能使智能体有效利用历史任务经验——在当前任务上实现更快收敛并生成更高质量的解决方案。

OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software

Abstract

arXiv:2505.23239v1 Announce Type: cross Abstract: Usability evaluation is critical to the impact and adoption of open source software (OSS), yet traditional methods relying on human evaluators suffer from high costs and limited scalability. To address these limitations, we introduce OSS-UAgent, an automated, configurable, and interactive agent-based usability evaluation framework specifically designed for open source software. Our framework employs intelligent agents powered by large language models (LLMs) to simulate developers performing programming tasks across various experience levels (from Junior to Expert). By dynamically constructing platform-specific knowledge bases, OSS-UAgent ensures accurate and context-aware code generation. The generated code is automatically evaluated across multiple dimensions, including compliance, correctness, and readability, providing a comprehensive measure of the software's usability. Additionally, our demonstration showcases OSS-UAgent's practical application in evaluating graph analytics platforms, highlighting its effectiveness in automating usability evaluation.

摘要

可用性评估对开源软件（OSS）的影响力和采用至关重要，然而依赖人工评估的传统方法存在成本高、可扩展性有限的问题。为应对这些局限，我们提出了OSS-UAgent——一个专为开源软件设计的自动化、可配置、基于交互式代理的可用性评估框架。该框架利用大型语言模型（LLMs）驱动的智能代理，模拟不同经验水平（从初级到专家）的开发人员执行编程任务。通过动态构建特定平台的知识库，OSS-UAgent确保生成准确且上下文感知的代码。生成的代码会在合规性、正确性和可读性等多个维度进行自动评估，从而全面衡量软件的可用性。此外，我们的演示展示了OSS-UAgent在图分析平台评估中的实际应用，突显了其在自动化可用性评估方面的有效性。

VERINA: Benchmarking Verifiable Code Generation

Abstract

arXiv:2505.23135v1 Announce Type: cross Abstract: Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation -- jointly generating code, specifications, and proofs of code-specification alignment -- offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

摘要

大型语言模型（LLMs）正日益融入软件开发，但确保LLM生成代码的正确性仍具挑战性，通常需要昂贵的人工审查。可验证代码生成——联合生成代码、规范及代码-规范一致性证明——为解决这一局限提供了可行路径，可进一步释放LLMs在编程领域的潜力。然而，当前评估存在显著缺口：现有基准测试往往缺乏对端到端可验证代码生成的支持。本文提出Verina（可验证代码生成竞技场），这是一个高质量基准测试平台，支持对代码、规范及证明生成及其组合进行模块化综合评估。Verina包含189个经人工筛选的Lean编程任务，涵盖详细问题描述、参考实现、形式化规范及大量测试用例。我们对前沿LLMs的广泛评估揭示了可验证代码生成中的重大挑战，尤其在证明生成方面，凸显了提升基于LLM的定理证明器在验证领域性能的必要性。表现最佳的模型OpenAI o4-mini在每任务单次尝试中仅生成61.4%正确代码、51.0%健全且完整的规范及3.6%成功证明。我们期望Verina能通过提供严格全面的基准测试，推动可验证代码生成领域的进展。数据集发布于https://huggingface.co/datasets/sunblaze-ucb/verina，评估代码发布于https://github.com/sunblaze-ucb/verina。

Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement

Abstract

arXiv:2505.23183v1 Announce Type: cross Abstract: Word-level quality estimation (WQE) aims to automatically identify fine-grained error spans in machine-translated outputs and has found many uses, including assisting translators during post-editing. Modern WQE techniques are often expensive, involving prompting of large language models or ad-hoc training on large amounts of human-labeled data. In this work, we investigate efficient alternatives exploiting recent advances in language model interpretability and uncertainty quantification to identify translation errors from the inner workings of translation models. In our evaluation spanning 14 metrics across 12 translation directions, we quantify the impact of human label variation on metric performance by using multiple sets of human labels. Our results highlight the untapped potential of unsupervised metrics, the shortcomings of supervised methods when faced with label uncertainty, and the brittleness of single-annotator evaluation practices.

摘要

单词级质量评估（WQE）旨在自动识别机器翻译输出中的细粒度错误片段，其应用场景包括辅助译员进行后期编辑。现代WQE技术往往成本高昂，涉及对大语言模型的提示或基于大量人工标注数据的临时训练。本研究探索了高效替代方案，利用语言模型可解释性和不确定性量化的最新进展，通过翻译模型内部机制识别翻译错误。我们在涵盖12个翻译方向、14项指标的评估中，通过使用多组人工标注量化了人工标注差异对指标性能的影响。研究结果揭示了无监督指标的未开发潜力、监督方法在面临标注不确定性时的局限性，以及单一标注者评估实践的脆弱性。

MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration

Abstract

arXiv:2505.23229v1 Announce Type: cross Abstract: The integration of Monte Carlo Tree Search (MCTS) with Large Language Models (LLMs) has demonstrated significant success in structured, problem-oriented tasks. However, applying these methods to open-ended dialogues, such as those in psychological counseling, presents unique challenges. Unlike tasks with objective correctness, success in therapeutic conversations depends on subjective factors like empathetic engagement, ethical adherence, and alignment with human preferences, for which strict "correctness" criteria are ill-defined. Existing result-oriented MCTS approaches can therefore produce misaligned responses. To address this, we introduce MCTSr-Zero, an MCTS framework designed for open-ended, human-centric dialogues. Its core innovation is "domain alignment", which shifts the MCTS search objective from predefined end-states towards conversational trajectories that conform to target domain principles (e.g., empathy in counseling). Furthermore, MCTSr-Zero incorporates "Regeneration" and "Meta-Prompt Adaptation" mechanisms to substantially broaden exploration by allowing the MCTS to consider fundamentally different initial dialogue strategies. We evaluate MCTSr-Zero in psychological counseling by generating multi-turn dialogue data, which is used to fine-tune an LLM, PsyLLM. We also introduce PsyEval, a benchmark for assessing multi-turn psychological counseling dialogues. Experiments demonstrate that PsyLLM achieves state-of-the-art performance on PsyEval and other relevant metrics, validating MCTSr-Zero's effectiveness in generating high-quality, principle-aligned conversational data for human-centric domains and addressing the LLM challenge of consistently adhering to complex psychological standards.

摘要

将蒙特卡洛树搜索（MCTS）与大型语言模型（LLMs）相结合的方法在结构化、问题导向型任务中已取得显著成功。然而，该方法在开放式对话（如心理咨询场景）中的应用面临独特挑战。由于此类对话缺乏客观正确性标准，其成功与否取决于共情互动、伦理遵循及人类偏好对齐等主观因素，难以定义严格的"正确性"准则。现有结果导向型MCTS方法可能因此产生偏差响应。为此，我们提出MCTSr-Zero——一个面向开放式人本对话的MCTS框架，其核心创新在于"领域对齐"机制：将搜索目标从预设终态转向符合目标领域原则（如咨询中的共情要求）的对话轨迹。该框架还引入"再生"与"元提示自适应"机制，通过允许MCTS探索本质不同的初始对话策略来大幅拓宽搜索空间。我们在心理咨询场景下评估MCTSr-Zero，生成多轮对话数据用于微调PsyLLM模型，并建立PsyEval基准来评估多轮心理咨询对话质量。实验表明，PsyLLM在PsyEval及相关指标上达到最先进水平，验证了MCTSr-Zero在生成符合人本领域原则的高质量对话数据方面的有效性，同时解决了LLMs持续遵循复杂心理标准的核心难题。

Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs

Abstract

arXiv:2505.23270v1 Announce Type: cross Abstract: In recent years, Large Language Models (LLMs) have achieved remarkable advancements, drawing significant attention from the research community. Their capabilities are largely attributed to large-scale architectures, which require extensive training on massive datasets. However, such datasets often contain sensitive or copyrighted content sourced from the public internet, raising concerns about data privacy and ownership. Regulatory frameworks, such as the General Data Protection Regulation (GDPR), grant individuals the right to request the removal of such sensitive information. This has motivated the development of machine unlearning algorithms that aim to remove specific knowledge from models without the need for costly retraining. Despite these advancements, evaluating the efficacy of unlearning algorithms remains a challenge due to the inherent complexity and generative nature of LLMs. In this work, we introduce a comprehensive auditing framework for unlearning evaluation, comprising three benchmark datasets, six unlearning algorithms, and five prompt-based auditing methods. By using various auditing algorithms, we evaluate the effectiveness and robustness of different unlearning strategies. To explore alternatives beyond prompt-based auditing, we propose a novel technique that leverages intermediate activation perturbations, addressing the limitations of auditing methods that rely solely on model inputs and outputs.

摘要

近年来，大型语言模型（LLMs）取得了显著进展，引起了研究界的广泛关注。其强大能力主要归功于大规模架构，这需要对海量数据集进行广泛训练。然而，此类数据集通常包含来自公共互联网的敏感或受版权保护内容，引发了关于数据隐私和所有权的担忧。《通用数据保护条例》（GDPR）等监管框架赋予个人要求删除此类敏感信息的权利。这推动了机器遗忘算法的发展，旨在无需昂贵重新训练的情况下从模型中移除特定知识。尽管取得这些进展，由于LLMs固有的复杂性和生成特性，评估遗忘算法的有效性仍具挑战性。本研究提出一个全面的遗忘评估审计框架，包含三个基准数据集、六种遗忘算法和五种基于提示的审计方法。通过使用多种审计算法，我们评估了不同遗忘策略的有效性和鲁棒性。为探索基于提示审计之外的替代方案，我们提出一种利用中间激活扰动的新技术，解决了仅依赖模型输入输出的审计方法的局限性。

Abstract

arXiv:2505.23250v1 Announce Type: cross Abstract: We present the methodology and results of the Deep Retrieval team for subtask 4b of the CLEF CheckThat! 2025 competition, which focuses on retrieving relevant scientific literature for given social media posts. To address this task, we propose a hybrid retrieval pipeline that combines lexical precision, semantic generalization, and deep contextual re-ranking, enabling robust retrieval that bridges the informal-to-formal language gap. Specifically, we combine BM25-based keyword matching with a FAISS vector store using a fine-tuned INF-Retriever-v1 model for dense semantic retrieval. BM25 returns the top 30 candidates, and semantic search yields 100 candidates, which are then merged and re-ranked via a large language model (LLM)-based cross-encoder. Our approach achieves a mean reciprocal rank at 5 (MRR@5) of 76.46% on the development set and 66.43% on the hidden test set, securing the 1st position on the development leaderboard and ranking 3rd on the test leaderboard (out of 31 teams), with a relative performance gap of only 2 percentage points compared to the top-ranked system. We achieve this strong performance by running open-source models locally and without external training data, highlighting the effectiveness of a carefully designed and fine-tuned retrieval pipeline.

摘要

我们展示了Deep Retrieval团队在CLEF CheckThat! 2025竞赛子任务4b中的方法与成果，该任务旨在为给定社交媒体帖子检索相关科学文献。为解决这一问题，我们提出了一种混合检索流程，结合词汇精确匹配、语义泛化与深度上下文重排序，通过跨越非正式与正式语言鸿沟实现鲁棒检索。具体而言，我们将基于BM25的关键词匹配与采用精调INF-Retriever-v1模型的FAISS向量库相结合进行稠密语义检索。BM25返回前30个候选，语义搜索产生100个候选，随后通过基于大语言模型（LLM）的交叉编码器进行合并与重排序。

该方法在开发集上达到76.46%的5级平均倒数排名（MRR@5），在隐藏测试集上获得66.43%的成绩，在开发排行榜位列第一，测试排行榜位列第三（共31支队伍），与排名第一的系统仅存在2个百分点的相对性能差距。我们通过本地运行开源模型且不使用外部训练数据实现了这一强劲性能，凸显了精心设计与精调检索流程的有效性。

Abstract

arXiv:2505.23266v1 Announce Type: cross Abstract: We present Adversarial Object Fusion (AdvOF), a novel attack framework targeting vision-and-language navigation (VLN) agents in service-oriented environments by generating adversarial 3D objects. While foundational models like Large Language Models (LLMs) and Vision Language Models (VLMs) have enhanced service-oriented navigation systems through improved perception and decision-making, their integration introduces vulnerabilities in mission-critical service workflows. Existing adversarial attacks fail to address service computing contexts, where reliability and quality-of-service (QoS) are paramount. We utilize AdvOF to investigate and explore the impact of adversarial environments on the VLM-based perception module of VLN agents. In particular, AdvOF first precisely aggregates and aligns the victim object positions in both 2D and 3D space, defining and rendering adversarial objects. Then, we collaboratively optimize the adversarial object with regularization between the adversarial and victim object across physical properties and VLM perceptions. Through assigning importance weights to varying views, the optimization is processed stably and multi-viewedly by iterative fusions from local updates and justifications. Our extensive evaluations demonstrate AdvOF can effectively degrade agent performance under adversarial conditions while maintaining minimal interference with normal navigation tasks. This work advances the understanding of service security in VLM-powered navigation systems, providing computational foundations for robust service composition in physical-world deployments.

摘要

我们提出对抗性物体融合（AdvOF），一种通过生成对抗性3D物体来针对服务导向环境中视觉与语言导航（VLN）智能体的新型攻击框架。尽管大型语言模型（LLM）和视觉语言模型（VLM）等基础模型通过提升感知与决策能力增强了服务导向导航系统，其集成过程却为关键任务服务流程引入了安全漏洞。现有对抗攻击未能考虑以服务可靠性和服务质量（QoS）为核心的服务计算场景。AdvOF通过研究对抗环境对VLN智能体基于VLM的感知模块的影响，首先在2D与3D空间精确聚合并对齐受害物体位置，定义并渲染对抗物体；随后通过物理属性与VLM感知层面的对抗-受害物体间正则化约束，协同优化对抗物体。通过为不同视角分配重要性权重，该优化过程借助局部更新与修正的迭代融合，实现了稳定且多视角的优化处理。大量实验表明，AdvOF能在保持对正常导航任务最小干扰的同时，有效降低对抗条件下智能体的性能表现。本研究深化了对VLM驱动导航系统服务安全的理解，为物理世界部署中鲁棒服务组合提供了计算基础。

Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective

Abstract

arXiv:2505.23277v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external context, but retrieved passages are often lengthy, noisy, or exceed input limits. Existing compression methods typically require supervised training of dedicated compression models, increasing cost and reducing portability. We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task. Rather than training a compression model, Sentinel probes decoder attention from an off-the-shelf 0.5B proxy LLM using a lightweight classifier to identify sentence relevance. Empirically, we find that query-context relevance estimation is consistent across model scales, with 0.5B proxies closely matching the behaviors of larger models. On the LongBench benchmark, Sentinel achieves up to 5 $\times$ compression while matching the QA performance of 7B-scale compression systems. Our results suggest that probing native attention signals enables fast, effective, and question-aware context compression. Code available at: https://github.com/yzhangchuck/Sentinel.

摘要

检索增强生成（RAG）通过外部上下文增强大语言模型（LLM）的性能，但检索到的段落往往冗长、包含噪声或超出输入限制。现有压缩方法通常需要监督训练专用的压缩模型，增加了成本并降低了可移植性。我们提出Sentinel，一种轻量级的句子级压缩框架，将上下文过滤重新定义为基于注意力的理解任务。Sentinel无需训练压缩模型，而是通过轻量级分类器探测现成的0.5B代理LLM的解码器注意力，以识别句子相关性。实证研究表明，查询-上下文相关性估计在不同规模模型间具有一致性，0.5B代理模型能紧密匹配更大模型的行为。在LongBench基准测试中，Sentinel实现了高达5倍的压缩率，同时匹配7B规模压缩系统的问答性能。结果表明，探测原生注意力信号能够实现快速、有效且问题感知的上下文压缩。代码发布于：https://github.com/yzhangchuck/Sentinel。

The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text

Abstract

arXiv:2505.23276v1 Announce Type: cross Abstract: Large Language Models (LLMs) have achieved unprecedented capabilities in generating human-like text, posing subtle yet significant challenges for information integrity across critical domains, including education, social media, and academia, enabling sophisticated misinformation campaigns, compromising healthcare guidance, and facilitating targeted propaganda. This challenge becomes severe, particularly in under-explored and low-resource languages like Arabic. This paper presents a comprehensive investigation of Arabic machine-generated text, examining multiple generation strategies (generation from the title only, content-aware generation, and text refinement) across diverse model architectures (ALLaM, Jais, Llama, and GPT-4) in academic, and social media domains. Our stylometric analysis reveals distinctive linguistic patterns differentiating human-written from machine-generated Arabic text across these varied contexts. Despite their human-like qualities, we demonstrate that LLMs produce detectable signatures in their Arabic outputs, with domain-specific characteristics that vary significantly between different contexts. Based on these insights, we developed BERT-based detection models that achieved exceptional performance in formal contexts (up to 99.9% F1-score) with strong precision across model architectures. Our cross-domain analysis confirms generalization challenges previously reported in the literature. To the best of our knowledge, this work represents the most comprehensive investigation of Arabic machine-generated text to date, uniquely combining multiple prompt generation methods, diverse model architectures, and in-depth stylometric analysis across varied textual domains, establishing a foundation for developing robust, linguistically-informed detection systems essential for preserving information integrity in Arabic-language contexts.

摘要

大型语言模型（LLM）在生成类人文本方面展现出前所未有的能力，这对教育、社交媒体和学术等关键领域的信息完整性构成了微妙而重大的挑战，包括助长复杂虚假信息传播、损害医疗指导准确性以及便利定向宣传。这一挑战在阿拉伯语等资源匮乏且研究不足的语言中尤为严峻。本文对阿拉伯语机器生成文本展开全面研究，通过学术与社交媒体领域的多架构模型（ALLaM、Jais、Llama和GPT-4）考察了三种生成策略（仅标题生成、内容感知生成及文本润色）。我们的文体计量分析揭示了不同语境下区分人工撰写与机器生成阿拉伯文本的独特语言模式。研究表明，尽管LLM产出具有高度拟人性，但其阿拉伯语输出仍存在可检测的特征标记，且不同领域间呈现显著差异的特定模式。基于这些发现，我们开发的BERT检测模型在正式语境中表现出卓越性能（F1值最高达99.9%），且在不同模型架构间均保持高精度。跨领域分析证实了文献中已报道的泛化挑战。据我们所知，本研究首次整合多提示生成方法、多样化模型架构及跨文本领域的深度文体分析，构成了迄今最全面的阿拉伯语机器生成文本研究，为开发基于语言学特征的鲁棒检测系统奠定了基础，这对维护阿拉伯语信息生态的完整性至关重要。

Accelerating RLHF Training with Reward Variance Increase

Abstract

arXiv:2505.23247v1 Announce Type: cross Abstract: Reinforcement learning from human feedback (RLHF) is an essential technique for ensuring that large language models (LLMs) are aligned with human values and preferences during the post-training phase. As an effective RLHF approach, group relative policy optimization (GRPO) has demonstrated success in many LLM-based applications. However, efficient GRPO-based RLHF training remains a challenge. Recent studies reveal that a higher reward variance of the initial policy model leads to faster RLHF training. Inspired by this finding, we propose a practical reward adjustment model to accelerate RLHF training by provably increasing the reward variance and preserving the relative preferences and reward expectation. Our reward adjustment method inherently poses a nonconvex optimization problem, which is NP-hard to solve in general. To overcome the computational challenges, we design a novel $O(n \log n)$ algorithm to find a global solution of the nonconvex reward adjustment model by explicitly characterizing the extreme points of the feasible set. As an important application, we naturally integrate this reward adjustment model into the GRPO algorithm, leading to a more efficient GRPO with reward variance increase (GRPOVI) algorithm for RLHF training. As an interesting byproduct, we provide an indirect explanation for the empirical effectiveness of GRPO with rule-based reward for RLHF training, as demonstrated in DeepSeek-R1. Experiment results demonstrate that the GRPOVI algorithm can significantly improve the RLHF training efficiency compared to the original GRPO algorithm.

摘要

基于人类反馈的强化学习（RLHF）是确保大型语言模型（LLM）在训练后阶段与人类价值观和偏好对齐的关键技术。作为一种有效的RLHF方法，群体相对策略优化（GRPO）已在许多基于LLM的应用中取得成功。然而，基于GRPO的高效RLHF训练仍面临挑战。近期研究表明，初始策略模型的较高奖励方差能加速RLHF训练。受此启发，我们提出一种实用的奖励调整模型，通过可证明地增加奖励方差并保持相对偏好和奖励期望来加速RLHF训练。该奖励调整方法本质上构成一个非凸优化问题，通常属于NP难问题。为克服计算挑战，我们设计了一种新颖的O(n log n)复杂度算法，通过显式表征可行集的极值点来求解非凸奖励调整模型的全局最优解。作为重要应用，我们将该模型自然整合至GRPO算法中，提出具有奖励方差提升特性的高效GRPOVI算法。作为有趣副产品，我们间接解释了DeepSeek-R1中展示的基于规则奖励的GRPO在RLHF训练中的实证有效性。实验结果表明，相较于原始GRPO算法，GRPOVI算法能显著提升RLHF训练效率。

Matryoshka Model Learning for Improved Elastic Student Models

Abstract

arXiv:2505.23337v1 Announce Type: cross Abstract: Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better relate to the Teacher model and also bring in more domain-specific expertise. Furthermore, multiple accurate Student models can be extracted from the TA model. Therefore, despite only one training run, our methodology provides multiple servable options to trade off accuracy for lower serving cost. We demonstrate the proposed method, MatTA, on proprietary datasets and models. Its practical efficacy is underscored by live A/B tests within a production ML system, demonstrating 20% improvement on a key metric. We also demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.

摘要

工业级机器学习模型需精心设计以满足快速演变的部署约束，这往往需要投入大量资源进行模型开发。本文提出MatTA框架，采用创新的"教师-TA-学生"训练方案来培育多个高精度学生模型。TA模型作为学生模型的大容量扩展版本，既能增强学生模型与教师模型的关联性，又能注入更多领域专业知识。此外，可从单个TA模型中提取多个高精度学生模型，使得仅需单次训练即可获得多个可部署选项，实现精度与服务成本的灵活权衡。我们在私有数据集和模型上验证了MatTA方法的有效性，生产环境中的A/B测试表明关键指标提升达20%。在公开模型GPT-2 Medium上的实验也显示，该方法使SAT数学成绩相对提升超24%，LAMBADA基准测试成绩提升超10%。

Towards Reward Fairness in RLHF: From a Resource Allocation Perspective

Abstract

arXiv:2505.23349v1 Announce Type: cross Abstract: Rewards serve as proxies for human preferences and play a crucial role in Reinforcement Learning from Human Feedback (RLHF). However, if these rewards are inherently imperfect, exhibiting various biases, they can adversely affect the alignment of large language models (LLMs). In this paper, we collectively define the various biases present in rewards as the problem of reward unfairness. We propose a bias-agnostic method to address the issue of reward fairness from a resource allocation perspective, without specifically designing for each type of bias, yet effectively mitigating them. Specifically, we model preference learning as a resource allocation problem, treating rewards as resources to be allocated while considering the trade-off between utility and fairness in their distribution. We propose two methods, Fairness Regularization and Fairness Coefficient, to achieve fairness in rewards. We apply our methods in both verification and reinforcement learning scenarios to obtain a fairness reward model and a policy model, respectively. Experiments conducted in these scenarios demonstrate that our approach aligns LLMs with human preferences in a more fair manner.

摘要

奖励作为人类偏好的代理指标，在基于人类反馈的强化学习（RLHF）中起着至关重要的作用。然而，若这些奖励本身存在缺陷并呈现多种偏差，则可能对大语言模型（LLMs）的对齐产生负面影响。本文将这些奖励中存在的各类偏差统一定义为奖励不公平问题。我们提出一种与偏差类型无关的方法，从资源分配的角度解决奖励公平性问题，无需针对每种偏差专门设计，却能有效缓解其影响。具体而言，我们将偏好学习建模为资源分配问题，将奖励视为待分配的资源，同时权衡其分配过程中的效用与公平性。为此提出两种方法——公平性正则化与公平性系数，以实现奖励的公平分配。我们在验证和强化学习两种场景中分别应用该方法，获得了公平性奖励模型和策略模型。实验结果表明，我们的方法能够以更公平的方式使LLMs与人类偏好保持一致。

Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

Abstract

arXiv:2505.23387v1 Announce Type: cross Abstract: Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization~(GRPO). Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both pass@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency.

摘要

大语言模型（LLMs）能够生成功能正确的解决方案，但在代码效率方面往往表现不足，这成为实际部署的关键瓶颈。本文提出了一种新颖的测试时迭代优化框架来解决这一问题，采用闭环系统使LLMs能够基于执行沙箱中的实证性能反馈迭代优化代码。我们探索了三种训练策略：监督微调（SFT）、直接偏好优化（DPO）和组相对策略优化（GRPO）。在Venus数据集和APPS基准上的实验表明，SFT和DPO在效率提升上迅速饱和。相比之下，GRPO通过结合强化学习（RL）与执行反馈，持续优化代码性能，显著提高了pass@1（从47%提升至62%）以及在效率上超越人类提交的概率（从31%提升至45%）。本研究不仅展示了测试时代码效率提升的有效方法，更关键地揭示了强化学习在教导LLMs真正实现代码效率自我优化方面的强大潜力。

How Does Response Length Affect Long-Form Factuality

Abstract

arXiv:2505.23295v1 Announce Type: cross Abstract: Large language models (LLMs) are widely used for long-form text generation. However, factual errors in the responses would undermine their reliability. Despite growing attention to LLM factuality, the effect of response length on factuality remains underexplored. In this work, we systematically investigate this relationship by first introducing an automatic and bi-level long-form factuality evaluation framework, which achieves high agreement with human annotations while being cost-effective. Using this framework, we conduct controlled experiments and find that longer responses exhibit lower factual precision, confirming the presence of length bias. To explain this phenomenon, we empirically examine three hypotheses: error propagation, long context, and facts exhaustion. Our results reveal that facts exhaustion, where the model gradually exhausts more reliable knowledge, is the primary cause of factual degradation, rather than the other two hypotheses.

摘要

大语言模型（LLMs）被广泛用于生成长文本。然而，回答中的事实性错误会削弱其可靠性。尽管对LLM事实性的关注日益增加，但回答长度对事实性的影响仍未得到充分探索。在本研究中，我们通过引入一个自动化、双层次的长文本事实性评估框架，系统地研究了这一关系。该框架在实现与人工标注高度一致的同时，兼具成本效益。利用这一框架，我们进行了对照实验，发现更长的回答表现出更低的事实精确度，证实了长度偏差的存在。为解释这一现象，我们实证检验了三种假设：错误传播、长上下文和事实耗尽。我们的结果表明，事实耗尽（即模型逐渐耗尽更可靠的知识）是事实性下降的主要原因，而非其他两种假设。

Can Large Language Models Challenge CNNS in Medical Image Analysis?

Abstract

arXiv:2505.23503v1 Announce Type: cross Abstract: This study presents a multimodal AI framework designed for precisely classifying medical diagnostic images. Utilizing publicly available datasets, the proposed system compares the strengths of convolutional neural networks (CNNs) and different large language models (LLMs). This in-depth comparative analysis highlights key differences in diagnostic performance, execution efficiency, and environmental impacts. Model evaluation was based on accuracy, F1-score, average execution time, average energy consumption, and estimated $CO_2$ emission. The findings indicate that although CNN-based models can outperform various multimodal techniques that incorporate both images and contextual information, applying additional filtering on top of LLMs can lead to substantial performance gains. These findings highlight the transformative potential of multimodal AI systems to enhance the reliability, efficiency, and scalability of medical diagnostics in clinical settings.

摘要

本研究提出一种用于精确分类医学诊断图像的多模态人工智能框架。该系统利用公开数据集，对比分析了卷积神经网络（CNN）与不同大语言模型（LLM）的性能优势。通过深入比较研究，重点揭示了这些模型在诊断性能、执行效率和环境影响方面的关键差异。模型评估指标包括准确率、F1分数、平均执行时间、平均能耗以及估算的二氧化碳排放量。研究结果表明：尽管基于CNN的模型可能优于同时结合图像与上下文信息的多模态技术，但在LLM基础上施加额外过滤机制可带来显著的性能提升。这些发现凸显了多模态AI系统在提升临床医学诊断可靠性、效率与可扩展性方面所具有的变革性潜力。

Cognitive Guardrails for Open-World Decision Making in Autonomous Drone Swarms

Abstract

arXiv:2505.23576v1 Announce Type: cross Abstract: Small Uncrewed Aerial Systems (sUAS) are increasingly deployed as autonomous swarms in search-and-rescue and other disaster-response scenarios. In these settings, they use computer vision (CV) to detect objects of interest and autonomously adapt their missions. However, traditional CV systems often struggle to recognize unfamiliar objects in open-world environments or to infer their relevance for mission planning. To address this, we incorporate large language models (LLMs) to reason about detected objects and their implications. While LLMs can offer valuable insights, they are also prone to hallucinations and may produce incorrect, misleading, or unsafe recommendations. To ensure safe and sensible decision-making under uncertainty, high-level decisions must be governed by cognitive guardrails. This article presents the design, simulation, and real-world integration of these guardrails for sUAS swarms in search-and-rescue missions.

摘要

小型无人航空系统（sUAS）在搜救及其他灾害响应场景中正越来越多地以自主集群形式部署。在此类场景中，它们通过计算机视觉（CV）技术检测目标对象并自主调整任务规划。然而，传统CV系统在开放世界环境中往往难以识别陌生对象或推断其对任务规划的相关性。为解决这一问题，我们引入大型语言模型（LLMs）对检测对象及其任务影响进行推理。虽然LLMs能提供有价值的洞见，但其存在幻觉倾向，可能产生错误、误导或不安全的建议。为确保不确定性条件下的决策安全可靠，高层决策必须受认知护栏约束。本文阐述了搜救任务中sUAS集群认知护栏的设计、仿真及实际系统集成方案。

Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation

Abstract

arXiv:2505.23657v1 Announce Type: cross Abstract: Recent decoding methods improve the factuality of large language models~(LLMs) by refining how the next token is selected during generation. These methods typically operate at the token level, leveraging internal representations to suppress superficial patterns. Nevertheless, LLMs remain prone to hallucinations, especially over longer contexts. In this paper, we propose Active Layer-Contrastive Decoding (ActLCD), a novel decoding strategy that actively decides when to apply contrasting layers during generation. By casting decoding as a sequential decision-making problem, ActLCD employs a reinforcement learning policy guided by a reward-aware classifier to optimize factuality beyond the token level. Our experiments demonstrate that ActLCD surpasses state-of-the-art methods across five benchmarks, showcasing its effectiveness in mitigating hallucinations in diverse generation scenarios.

摘要

现有解码方法通过改进生成过程中下一词元的选取策略，提升了大型语言模型（LLMs）的事实性。这些方法通常在词元层级运作，利用内部表征来抑制表层模式。然而，LLMs仍易产生幻觉，尤其在长上下文场景中。本文提出主动层对比解码（ActLCD），这是一种通过动态决策何时在生成过程中应用对比层的新颖解码策略。通过将解码建模为序列决策问题，ActLCD采用由奖励感知分类器引导的强化学习策略，实现超越词元层级的事实性优化。实验表明，ActLCD在五项基准测试中均优于现有最优方法，验证了其在多样化生成场景中缓解幻觉现象的有效性。

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Abstract

arXiv:2505.23564v1 Announce Type: cross Abstract: Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: Token-level methods (e.g., PPO) aim to provide the fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving $6$ - $12$ percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving $7$ - $11$ percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

摘要

有效利用强化学习（RL）增强大语言模型的推理能力仍是一个关键挑战。现有方法主要采用两种对比鲜明的优势估计粒度：令牌级方法（如PPO）旨在提供细粒度优势信号，但因难以训练精确的评判模型而导致估计不准；轨迹级方法（如GRPO）仅依赖最终奖励的粗粒度优势信号，导致信用分配不精确。为解决这些局限，我们提出分段策略优化（SPO），这是一种新颖的RL框架，利用中等粒度的分段级优势估计，通过比轨迹级方法更精确的信用分配、且比令牌级方法更少的估计点，实现了基于蒙特卡洛（MC）的无评判模型精确优势估计。SPO包含三个创新策略组件：(1) 灵活分段划分；(2) 精确分段优势估计；(3) 基于分段优势的策略优化（含新型概率掩码策略）。我们进一步针对两种场景实现SPO：(1) SPO-chain用于短思维链（CoT），采用创新的基于切点的划分和链式优势估计，在GSM8K上比PPO和GRPO准确率提升6-12个百分点；(2) SPO-tree用于长CoT，采用创新的树状优势估计，显著降低MC估计成本，在2K和4K上下文评估的MATH500上比GRPO提升7-11个百分点。代码已开源：https://github.com/AIFrameResearch/SPO。

AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora

Abstract

arXiv:2505.23628v1 Announce Type: cross Abstract: We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 95% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.

摘要

我们提出AutoSchemaKG框架，这是一种完全自主的知识图谱构建方法，无需预定义模式。该系统利用大语言模型直接从文本中同步抽取知识三元组并推导出完整模式，同时对实体和事件进行建模，并运用概念化方法将实例组织到语义类别中。通过处理超过5000万份文档，我们构建了ATLAS（自动化三元组链接与模式归纳）系列知识图谱，包含超过9亿个节点和59亿条边。该方法在多跳问答任务中优于现有最先进基线，并提升了大型语言模型的事实准确性。值得注意的是，我们的模式归纳实现了与人工构建模式95%的语义对齐，且完全无需人工干预，这表明具有动态归纳模式的十亿规模知识图谱能有效补充大型语言模型中的参数化知识。

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Abstract

arXiv:2505.23604v1 Announce Type: cross Abstract: Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

摘要

语言模型（LMs）在标准化编程基准测试中表现良好，但在解决SWE-Bench中GitHub问题等实际软件工程任务时表现欠佳，尤其是当模型参数小于100B时。尽管较小模型因计算成本较低而更受实践青睐，但其性能提升仍具挑战性。现有方法主要依赖高质量数据的监督微调（SFT），但大规模数据整理成本高昂。另一种方案是测试时扩展：生成多个输出，通过验证器评分并选择最优结果。虽然有效，但该策略通常需要过量采样和高成本评分，限制了实际应用。我们提出进化测试时扩展（EvoScale），这是一种将生成视为进化过程的样本高效方法。通过选择与变异迭代优化输出，EvoScale将输出分布向高分区移动，从而减少找到正确解所需的样本量。为降低重复采样与选择的开销，我们采用强化学习（RL）训练模型实现自我进化。模型在推理时无需依赖外部验证器，而是学会通过迭代自我提升生成结果的评分。在SWE-Bench-Verified上的评估表明，EvoScale使我们的32B模型Satori-SWE-32B仅需少量样本即可匹配或超越超100B参数模型的性能。代码、数据与模型将全面开源。

Engineering Serendipity through Recommendations of Items with Atypical Aspects

Abstract

arXiv:2505.23580v1 Announce Type: cross Abstract: A restaurant dinner or a hotel stay may lead to memorable experiences when guests encounter unexpected aspects that also match their interests. For example, an origami-making station in the waiting area of a restaurant may be both surprising and enjoyable for a customer who is passionate about paper crafts. Similarly, an exhibit of 18th century harpsichords would be atypical for a hotel lobby and likely pique the interest of a guest who has a passion for Baroque music. Motivated by this insight, in this paper we introduce the new task of engineering serendipity through recommendations of items with atypical aspects. We describe an LLM-based system pipeline that extracts atypical aspects from item reviews, then estimates and aggregates their user-specific utility in a measure of serendipity potential that is used to rerank a list of items recommended to the user. To facilitate system development and evaluation, we introduce a dataset of Yelp reviews that are manually annotated with atypical aspects and a dataset of artificially generated user profiles, together with crowdsourced annotations of user-aspect utility values. Furthermore, we introduce a custom procedure for dynamic selection of in-context learning examples, which is shown to improve LLM-based judgments of atypicality and utility. Experimental evaluations show that serendipity-based rankings generated by the system are highly correlated with ground truth rankings for which serendipity scores are computed from manual annotations of atypical aspects and their user-dependent utility. Overall, we hope that the new recommendation task and the associated system presented in this paper catalyze further research into recommendation approaches that go beyond accuracy in their pursuit of enhanced user satisfaction. The datasets and the code are made publicly available at https://github.com/ramituncc49er/ATARS .

摘要

当顾客遇到既出人意料又符合其兴趣的非常规元素时，餐厅用餐或酒店住宿可能成为难忘体验。例如，餐厅等候区设置的折纸工作台，对于痴迷纸艺的顾客既惊喜又愉悦；同样，酒店大堂陈列18世纪大键琴的展品虽非典型，却可能激起巴洛克音乐爱好者的浓厚兴趣。基于这一洞察，本文提出通过推荐具有非常规属性的物品来实现"意外惊喜"的新任务。我们描述了一个基于大语言模型（LLM）的系统流程：首先从商品评论中提取非常规属性，随后估算并聚合这些属性的用户特定效用，形成用于商品列表重排序的"意外惊喜潜力"度量指标。为支持系统开发与评估，我们构建了两个数据集：包含人工标注非常规属性的Yelp评论数据集，以及人工生成的用户画像数据集（附带众包标注的用户-属性效用值）。此外，我们提出了一种动态选择上下文学习示例的自定义方法，该方法可提升LLM对非常规性和效用判断的准确性。实验评估表明，系统生成的基于意外惊喜的排序结果与真实排序（通过人工标注的非常规属性及其用户相关效用计算得出）具有高度相关性。总体而言，我们希望本文提出的新推荐任务及相关系统能推动超越准确率指标、致力于提升用户满意度的推荐方法研究。

GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Abstract

arXiv:2505.23671v1 Announce Type: cross Abstract: Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models' capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

摘要

开发高性能软件是一项需要专业知识的复杂任务。我们提出了GSO基准测试，用于评估语言模型在开发高性能软件方面的能力。我们开发了一个自动化流程，通过生成和执行性能测试来分析代码库提交历史，从中识别出10个代码库中102个具有挑战性的优化任务，涵盖多个领域和编程语言。该测试为智能体提供代码库和性能测试作为精确规范，要求其提升运行时效率，并以专业开发者的优化效果作为衡量标准。定量评估显示，领先的SWE智能体表现欠佳，成功率不足5%，即使通过推理时间扩展也仅能实现有限改进。定性分析揭示了关键失败模式，包括处理低级语言的困难、采用惰性优化策略以及准确定位性能瓶颈的挑战。我们公开了基准测试的代码、工具及智能体运行轨迹，以促进未来研究。

Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles

Abstract

arXiv:2505.23590v1 Announce Type: cross Abstract: The application of rule-based reinforcement learning (RL) to multimodal large language models (MLLMs) introduces unique challenges and potential deviations from findings in text-only domains, particularly for perception-heavy tasks. This paper provides a comprehensive study of rule-based visual RL using jigsaw puzzles as a structured experimental framework, revealing several key findings. \textit{Firstly,} we find that MLLMs, initially performing near to random guessing on simple puzzles, achieve near-perfect accuracy and generalize to complex, unseen configurations through fine-tuning. \textit{Secondly,} training on jigsaw puzzles can induce generalization to other visual tasks, with effectiveness tied to specific task configurations. \textit{Thirdly,} MLLMs can learn and generalize with or without explicit reasoning, though open-source models often favor direct answering. Consequently, even when trained for step-by-step reasoning, they can ignore the thinking process in deriving the final answer. \textit{Fourthly,} we observe that complex reasoning patterns appear to be pre-existing rather than emergent, with their frequency increasing alongside training and task difficulty. \textit{Finally,} our results demonstrate that RL exhibits more effective generalization than Supervised Fine-Tuning (SFT), and an initial SFT cold start phase can hinder subsequent RL optimization. Although these observations are based on jigsaw puzzles and may vary across other visual tasks, this research contributes a valuable piece of jigsaw to the larger puzzle of collective understanding rule-based visual RL and its potential in multimodal learning. The code is available at: \href{https://github.com/zifuwanggg/Jigsaw-R1}{https://github.com/zifuwanggg/Jigsaw-R1}.

摘要

基于规则的多模态大语言模型强化学习应用在感知密集型任务中面临独特挑战，其表现可能与纯文本领域的研究结论存在差异。本研究以拼图游戏为结构化实验框架，对基于规则的视觉强化学习进行了系统探究，获得以下关键发现：首先，多模态大语言模型在简单拼图任务中初始表现接近随机猜测，但通过微调可实现近乎完美的准确率，并能泛化至复杂的未见配置；其次，拼图训练能诱导模型泛化至其他视觉任务，其效果与特定任务配置相关；第三，无论是否存在显式推理过程，模型均能实现学习与泛化，但开源模型往往倾向于直接作答——即便经过分步推理训练，仍可能忽略思维过程直接输出最终答案；第四，复杂推理模式更似预先存在而非训练涌现，其出现频率随训练进程和任务难度同步提升；最后，实验表明强化学习比监督微调具有更优的泛化能力，而初始监督微调冷启动阶段会阻碍后续强化学习优化。虽然这些发现基于拼图任务且可能随视觉任务类型变化，但本研究为理解基于规则的视觉强化学习及其在多模态学习中的潜力贡献了关键拼图。代码已开源：https://github.com/zifuwanggg/Jigsaw-R1。

VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos

Abstract

arXiv:2505.23693v1 Announce Type: cross Abstract: MLLMs have been widely studied for video question answering recently. However, most existing assessments focus on natural videos, overlooking synthetic videos, such as AI-generated content (AIGC). Meanwhile, some works in video generation rely on MLLMs to evaluate the quality of generated videos, but the capabilities of MLLMs on interpreting AIGC videos remain largely underexplored. To address this, we propose a new benchmark, VF-Eval, which introduces four tasks-coherence validation, error awareness, error type detection, and reasoning evaluation-to comprehensively evaluate the abilities of MLLMs on AIGC videos. We evaluate 13 frontier MLLMs on VF-Eval and find that even the best-performing model, GPT-4.1, struggles to achieve consistently good performance across all tasks. This highlights the challenging nature of our benchmark. Additionally, to investigate the practical applications of VF-Eval in improving video generation, we conduct an experiment, RePrompt, demonstrating that aligning MLLMs more closely with human feedback can benefit video generation.

摘要

近年来，多模态大语言模型（MLLMs）在视频问答领域得到广泛研究。然而，现有评估大多集中于自然视频，忽视了合成视频（如AI生成内容/AIGC）。与此同时，部分视频生成研究依赖MLLMs评估生成视频质量，但MLLMs对AIGC视频的解析能力仍缺乏深入探索。为此，我们提出新基准VF-Eval，通过四项任务——连贯性验证、错误识别、错误类型检测和推理评估——全面评估MLLMs在AIGC视频上的能力。我们对13个前沿MLLMs进行测试，发现即使表现最佳的GPT-4.1模型也难以在所有任务中保持稳定优异表现，这凸显了本基准的挑战性。此外，为探究VF-Eval在改进视频生成中的实际应用，我们设计RePrompt实验，证明使MLLMs更贴合人类反馈有助于提升视频生成质量。

SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA

Abstract

arXiv:2505.23724v1 Announce Type: cross Abstract: Parameter-Efficient Fine-Tuning (PEFT) methods, particularly Low-Rank Adaptation (LoRA), are indispensable for efficiently customizing Large Language Models (LLMs). However, vanilla LoRA suffers from slow convergence speed and knowledge forgetting problems. Recent studies have leveraged the power of designed LoRA initialization, to enhance the fine-tuning efficiency, or to preserve knowledge in the pre-trained LLM. However, none of these works can address the two cases at the same time. To this end, we introduce Subspace-Constrained LoRA (SC-LoRA), a novel LoRA initialization framework engineered to navigate the trade-off between efficient fine-tuning and knowledge preservation. We achieve this by constraining the output of trainable LoRA adapters in a low-rank subspace, where the context information of fine-tuning data is most preserved while the context information of preserved knowledge is least retained, in a balanced way. Such constraint enables the trainable weights to primarily focus on the main features of fine-tuning data while avoiding damaging the preserved knowledge features. We provide theoretical analysis on our method, and conduct extensive experiments including safety preservation and world knowledge preservation, on various downstream tasks. In our experiments, SC-LoRA succeeds in delivering superior fine-tuning performance while markedly diminishing knowledge forgetting, surpassing contemporary LoRA initialization methods.

摘要

参数高效微调（PEFT）方法，尤其是低秩自适应（LoRA），对于高效定制大语言模型（LLM）至关重要。然而，原始LoRA存在收敛速度慢和知识遗忘问题。近期研究通过设计LoRA初始化策略，提升了微调效率或保留了预训练LLM中的知识，但均无法同时解决这两个问题。为此，我们提出子空间约束LoRA（SC-LoRA），这是一种新颖的LoRA初始化框架，旨在平衡高效微调与知识保留之间的权衡。该方法通过将可训练LoRA适配器的输出约束在低秩子空间内实现平衡——该子空间最大程度保留微调数据的上下文信息，同时最小程度保留待保护知识的上下文信息。这种约束使可训练权重主要聚焦于微调数据的核心特征，同时避免破坏已保存的知识特征。我们提供了理论分析，并在多种下游任务（包括安全保护和世界知识保留）上进行了大量实验。结果表明，SC-LoRA在显著减少知识遗忘的同时，能提供更优的微调性能，超越了现有LoRA初始化方法。

ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering

Abstract

arXiv:2505.23723v1 Announce Type: cross Abstract: The emergence of large language model (LLM)-based agents has significantly advanced the development of autonomous machine learning (ML) engineering. However, most existing approaches rely heavily on manual prompt engineering, failing to adapt and optimize based on diverse experimental experiences. Focusing on this, for the first time, we explore the paradigm of learning-based agentic ML, where an LLM agent learns through interactive experimentation on ML tasks using online reinforcement learning (RL). To realize this, we propose a novel agentic ML training framework with three key components: (1) exploration-enriched fine-tuning, which enables LLM agents to generate diverse actions for enhanced RL exploration; (2) step-wise RL, which enables training on a single action step, accelerating experience collection and improving training efficiency; (3) an agentic ML-specific reward module, which unifies varied ML feedback signals into consistent rewards for RL optimization. Leveraging this framework, we train ML-Agent, driven by a 7B-sized Qwen-2.5 LLM for autonomous ML. Remarkably, despite being trained on merely 9 ML tasks, our 7B-sized ML-Agent outperforms the 671B-sized DeepSeek-R1 agent. Furthermore, it achieves continuous performance improvements and demonstrates exceptional cross-task generalization capabilities.

摘要

基于大语言模型（LLM）的智能体出现显著推动了自主机器学习（ML）工程的发展。然而，现有方法大多严重依赖人工提示工程，无法根据多样化的实验经验进行自适应优化。针对这一问题，我们首次探索了基于学习的智能体化ML范式，即LLM智能体通过在线强化学习（RL）在ML任务上进行交互式实验来学习。为实现这一目标，我们提出了一种新型智能体化ML训练框架，包含三个关键组件：（1）探索增强的微调，使LLM智能体能生成多样化动作以增强RL探索；（2）分步RL，支持对单一动作步骤进行训练，加速经验收集并提升训练效率；（3）专为智能体化ML设计的奖励模块，将多样的ML反馈信号统一为适用于RL优化的连贯奖励。基于此框架，我们训练了由70亿参数Qwen-2.5 LLM驱动的ML-Agent用于自主ML。值得注意的是，尽管仅基于9个ML任务进行训练，我们的70亿参数ML-Agent性能超越了6710亿参数的DeepSeek-R1智能体。此外，该智能体实现了持续性能提升，并展现出卓越的跨任务泛化能力。

Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time

Abstract

arXiv:2505.23729v1 Announce Type: cross Abstract: Aligning large language models with humans is challenging due to the inherently multifaceted nature of preference feedback. While existing approaches typically frame this as a multi-objective optimization problem, they often overlook how humans actually make decisions. Research on bounded rationality suggests that human decision making follows satisficing strategies-optimizing primary objectives while ensuring others meet acceptable thresholds. To bridge this gap and operationalize the notion of satisficing alignment, we propose SITAlign: an inference time framework that addresses the multifaceted nature of alignment by maximizing a primary objective while satisfying threshold-based constraints on secondary criteria. We provide theoretical insights by deriving sub-optimality bounds of our satisficing based inference alignment approach. We empirically validate SITAlign's performance through extensive experimentation on multiple benchmarks. For instance, on the PKU-SafeRLHF dataset with the primary objective of maximizing helpfulness while ensuring a threshold on harmlessness, SITAlign outperforms the state-of-the-art multi objective decoding strategy by a margin of 22.3% in terms of GPT-4 win-tie rate for helpfulness reward while adhering to the threshold on harmlessness.

摘要

由于偏好反馈本质上具有多维度特性，将大语言模型与人类价值观对齐具有挑战性。现有方法通常将其视为多目标优化问题，但往往忽视了人类实际决策机制。有限理性研究表明，人类决策遵循满意化策略——在优化主要目标的同时确保其他目标达到可接受阈值。为弥合这一差距并实现满意化对齐，我们提出SITAlign：一种推理时框架，通过最大化主要目标并满足次要目标的阈值约束来处理对齐的多维度特性。我们从理论上推导了基于满意化推理对齐方法的次优性边界。通过在多个基准测试上的广泛实验，我们实证验证了SITAlign的性能。例如，在PKU-SafeRLHF数据集上以最大化帮助性为主要目标、同时确保无害性达到阈值时，SITAlign在保持无害性阈值的前提下，其帮助性奖励的GPT-4胜平率比当前最先进的多目标解码策略高出22.3%。

DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning

Abstract

arXiv:2505.23754v1 Announce Type: cross Abstract: Theorem proving serves as a major testbed for evaluating complex reasoning abilities in large language models (LLMs). However, traditional automated theorem proving (ATP) approaches rely heavily on formal proof systems that poorly align with LLMs' strength derived from informal, natural language knowledge acquired during pre-training. In this work, we propose DeepTheorem, a comprehensive informal theorem-proving framework exploiting natural language to enhance LLM mathematical reasoning. DeepTheorem includes a large-scale benchmark dataset consisting of 121K high-quality IMO-level informal theorems and proofs spanning diverse mathematical domains, rigorously annotated for correctness, difficulty, and topic categories, accompanied by systematically constructed verifiable theorem variants. We devise a novel reinforcement learning strategy (RL-Zero) explicitly tailored to informal theorem proving, leveraging the verified theorem variants to incentivize robust mathematical inference. Additionally, we propose comprehensive outcome and process evaluation metrics examining proof correctness and the quality of reasoning steps. Extensive experimental analyses demonstrate DeepTheorem significantly improves LLM theorem-proving performance compared to existing datasets and supervised fine-tuning protocols, achieving state-of-the-art accuracy and reasoning quality. Our findings highlight DeepTheorem's potential to fundamentally advance automated informal theorem proving and mathematical exploration.

摘要

定理证明是评估大语言模型（LLM）复杂推理能力的重要测试平台。然而，传统自动定理证明（ATP）方法过度依赖形式化证明系统，与LLM通过预训练获得的非形式化自然语言知识优势存在显著偏差。本研究提出DeepTheorem——一个利用自然语言增强LLM数学推理的综合性非形式化定理证明框架。该框架包含一个由12.1万条高质量IMO级别非形式化定理及证明构成的大规模基准数据集，涵盖多个数学领域，并严格标注了正确性、难度和主题分类，同时提供系统构建的可验证定理变体。我们设计了一种专门针对非形式化定理证明的新型强化学习策略（RL-Zero），通过已验证定理变体激励稳健的数学推理。此外，我们提出综合的结果与过程评估指标，用于检验证明正确性及推理步骤质量。大量实验分析表明，相较于现有数据集和监督微调方案，DeepTheorem显著提升了LLM的定理证明性能，在准确性和推理质量方面达到最先进水平。本研究揭示了DeepTheorem在推动非形式化自动定理证明和数学探索方面的根本性潜力。

ATLAS: Learning to Optimally Memorize the Context at Test Time

Abstract

arXiv:2505.23735v1 Announce Type: cross Abstract: Transformers have been established as the most popular backbones in sequence modeling, mainly due to their effectiveness in in-context retrieval tasks and the ability to learn at scale. Their quadratic memory and time complexity, however, bound their applicability in longer sequences and so has motivated researchers to explore effective alternative architectures such as modern recurrent neural networks (a.k.a long-term recurrent memory module). Despite their recent success in diverse downstream tasks, they struggle in tasks that requires long context understanding and extrapolation to longer sequences. We observe that these shortcomings come from three disjoint aspects in their design: (1) limited memory capacity that is bounded by the architecture of memory and feature mapping of the input; (2) online nature of update, i.e., optimizing the memory only with respect to the last input; and (3) less expressive management of their fixed-size memory. To enhance all these three aspects, we present ATLAS, a long-term memory module with high capacity that learns to memorize the context by optimizing the memory based on the current and past tokens, overcoming the online nature of long-term memory models. Building on this insight, we present a new family of Transformer-like architectures, called DeepTransformers, that are strict generalizations of the original Transformer architecture. Our experimental results on language modeling, common-sense reasoning, recall-intensive, and long-context understanding tasks show that ATLAS surpasses the performance of Transformers and recent linear recurrent models. ATLAS further improves the long context performance of Titans, achieving +80% accuracy in 10M context length of BABILong benchmark.

摘要

Transformer已成为序列建模中最流行的主干网络，主要得益于其在上下文检索任务中的高效性及规模化学习能力。然而，其二次方的内存与时间复杂度限制了其在长序列中的应用，这促使研究者探索现代循环神经网络（即长期循环记忆模块）等有效替代架构。尽管这些模型在多样下游任务中取得近期成功，但在需要长上下文理解及长序列外推的任务中仍存在不足。我们观察到这些缺陷源于其设计中的三个独立方面：（1）受限于内存架构与输入特征映射的有限记忆容量；（2）在线更新特性，即仅基于最后输入优化记忆；（3）固定大小内存的表达管理能力较弱。为全面提升这三个方面，我们提出ATLAS——一种高容量的长期记忆模块，其通过基于当前与历史标记优化记忆来学习上下文存储，克服了长期记忆模型的在线特性。基于此洞见，我们提出名为DeepTransformers的新型类Transformer架构家族，其严格推广了原始Transformer架构。在语言建模、常识推理、召回密集型及长上下文理解任务上的实验结果表明，ATLAS超越了Transformer及近期线性循环模型的性能。ATLAS进一步提升了Titans模型的长上下文性能，在BABILong基准测试的1000万上下文长度中实现了+80%的准确率提升。

From Chat Logs to Collective Insights: Aggregative Question Answering

Abstract

arXiv:2505.23765v1 Announce Type: cross Abstract: Conversational agents powered by large language models (LLMs) are rapidly becoming integral to our daily interactions, generating unprecedented amounts of conversational data. Such datasets offer a powerful lens into societal interests, trending topics, and collective concerns. Yet, existing approaches typically treat these interactions as independent and miss critical insights that could emerge from aggregating and reasoning across large-scale conversation logs. In this paper, we introduce Aggregative Question Answering, a novel task requiring models to reason explicitly over thousands of user-chatbot interactions to answer aggregative queries, such as identifying emerging concerns among specific demographics. To enable research in this direction, we construct a benchmark, WildChat-AQA, comprising 6,027 aggregative questions derived from 182,330 real-world chatbot conversations. Experiments show that existing methods either struggle to reason effectively or incur prohibitive computational costs, underscoring the need for new approaches capable of extracting collective insights from large-scale conversational data.

摘要

由大语言模型（LLMs）驱动的对话代理正迅速成为日常交互的重要组成部分，产生了前所未有的对话数据量。这类数据集为洞察社会兴趣、热点话题和集体关切提供了强大视角。然而，现有方法通常将这些交互视为独立事件，未能通过大规模对话日志的聚合与推理获取关键洞见。本文提出聚合式问答这一新任务，要求模型显式地对数千条用户-聊天机器人交互进行推理，以回答聚合查询（例如识别特定人群中出现的新关切）。为推进该方向研究，我们构建了WildChat-AQA基准数据集，包含从182,330条真实聊天机器人对话中提取的6,027个聚合问题。实验表明，现有方法要么难以有效推理，要么计算成本过高，这凸显了需要开发新方法以从大规模对话数据中提取集体洞见。

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Abstract

arXiv:2505.23747v1 Announce Type: cross Abstract: Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct the Spatial-MLLM-120k dataset and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that our spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

摘要

多模态大语言模型（MLLMs）的最新进展显著提升了二维视觉任务性能，但其空间智能的提升仍面临挑战。现有三维MLLMs普遍依赖额外的3D或2.5D数据来实现空间感知，这限制了其在仅含二维输入（如图像或视频）场景中的应用。本文提出Spatial-MLLM——一个基于纯二维观测实现视觉空间推理的新型框架。与传统视频MLLMs采用面向语义理解的CLIP视觉编码器不同，我们的核心思路是释放前馈式视觉几何基础模型强大的结构先验。具体而言，我们设计双编码器架构：预训练二维视觉编码器提取语义特征，从视觉几何模型主干初始化的空间编码器提取三维结构特征，再通过连接器将两类特征融合为统一视觉标记以增强空间理解。此外，我们在推理阶段提出空间感知的帧采样策略，通过筛选视频序列中富含空间信息的帧，确保模型在有限标记长度下仍能聚焦于空间推理关键帧。除架构改进外，我们构建了Spatial-MLLM-120k数据集，采用监督微调和GRPO进行模型训练。在多个真实数据集上的实验表明，我们的Spatial-MLLM在各类视觉空间理解与推理任务中均达到最先进性能。项目页面：https://diankun-wu.github.io/Spatial-MLLM/。

Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game

Abstract

arXiv:2310.18940v4 Announce Type: replace Abstract: Agents built with large language models (LLMs) have shown great potential across a wide range of domains. However, in complex decision-making tasks, pure LLM-based agents tend to exhibit intrinsic bias in their choice of actions, which is inherited from the model's training data and results in suboptimal performance. To develop strategic language agents, i.e., agents that generate flexible language actions and possess strong decision-making abilities, we propose a novel framework that powers LLM-based agents with reinforcement learning (RL). We consider Werewolf, a popular social deduction game, as a challenging testbed that emphasizes versatile communication and strategic gameplay. To mitigate the intrinsic bias in language actions, our agents use an LLM to perform deductive reasoning and generate a diverse set of action candidates. Then an RL policy trained to optimize the decision-making ability chooses an action from the candidates to play in the game. Extensive experiments show that our agents overcome the intrinsic bias and outperform existing LLM-based agents in the Werewolf game. We also conduct human-agent experiments and find that our agents achieve human-level performance and demonstrate strong strategic play.

摘要

基于大语言模型（LLM）构建的智能体在众多领域展现出巨大潜力。然而在复杂决策任务中，纯LLM智能体往往表现出选择动作时的固有偏差，这种偏差源自模型训练数据并导致次优表现。为开发具有灵活语言生成能力和强大决策水平的战略语言智能体，我们提出了一种融合强化学习（RL）的新型LLM智能体框架。本研究以强调多样化沟通与战略博弈的社交推理游戏"狼人杀"作为挑战性测试平台。为消除语言动作中的固有偏差，我们的智能体利用LLM进行演绎推理并生成多样化候选动作集，随后由经过决策能力优化的RL策略从候选动作中选择游戏行为。大量实验表明，我们的智能体成功克服了固有偏差，在狼人杀游戏中表现优于现有LLM智能体。人机对抗实验进一步证实，该智能体达到了人类水平并展现出强大的战略博弈能力。

Large Language Model Driven Recommendation

Abstract

arXiv:2408.10946v2 Announce Type: replace Abstract: While previous chapters focused on recommendation systems (RSs) based on standardized, non-verbal user feedback such as purchases, views, and clicks -- the advent of LLMs has unlocked the use of natural language (NL) interactions for recommendation. This chapter discusses how LLMs' abilities for general NL reasoning present novel opportunities to build highly personalized RSs -- which can effectively connect nuanced and diverse user preferences to items, potentially via interactive dialogues. To begin this discussion, we first present a taxonomy of the key data sources for language-driven recommendation, covering item descriptions, user-system interactions, and user profiles. We then proceed to fundamental techniques for LLM recommendation, reviewing the use of encoder-only and autoregressive LLM recommendation in both tuned and untuned settings. Afterwards, we move to multi-module recommendation architectures in which LLMs interact with components such as retrievers and RSs in multi-stage pipelines. This brings us to architectures for conversational recommender systems (CRSs), in which LLMs facilitate multi-turn dialogues where each turn presents an opportunity not only to make recommendations, but also to engage with the user in interactive preference elicitation, critiquing, and question-answering.

OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance

Abstract

arXiv:2407.20761v4 Announce Type: replace Abstract: Vision-language instruction-tuning models have recently achieved significant performance improvements. In this work, we discover that large-scale 3D parallel training on those models leads to an imbalanced computation load across different devices. The vision and language parts are inherently heterogeneous: their data distribution and model architecture differ significantly, which affects distributed training efficiency. To address this issue, we rebalance the computational load from data, model, and memory perspectives, achieving more balanced computation across devices. Specifically, for the data, instances are grouped into new balanced mini-batches within and across devices. A search-based method is employed for the model to achieve a more balanced partitioning. For memory optimization, we adaptively adjust the re-computation strategy for each partition to utilize the available memory fully. These three perspectives are not independent but are closely connected, forming an omniverse balanced training framework. Extensive experiments are conducted to validate the effectiveness of our method. Compared with the open-source training code of InternVL-Chat, training time is reduced greatly, achieving about 1.8 $\times$ speed-up. Our method's efficacy and generalizability are further validated across various models and datasets. Codes will be released at https://github.com/ModelTC/OmniBal.

摘要

视觉语言指令调优模型近期取得了显著的性能提升。本研究发现，对这些模型进行大规模三维并行训练会导致不同设备间计算负载不均衡。视觉与语言模块存在固有异构性：二者数据分布与模型架构差异显著，这影响了分布式训练效率。为解决该问题，我们从数据、模型和内存三个维度重新平衡计算负载，实现设备间更均衡的计算分配。具体而言，在数据层面，将样本在设备内与跨设备重组为平衡的微批次；模型层面采用基于搜索的方法实现更均衡的划分；内存优化层面则自适应调整各分区的重计算策略以充分利用可用内存。这三个维度并非独立，而是相互关联形成全维度平衡训练框架。通过大量实验验证了方法的有效性：相比InternVL-Chat开源训练代码，训练时间大幅缩短，实现约1.8倍加速。本方法的有效性和泛化能力在多种模型与数据集上得到进一步验证。代码将在https://github.com/ModelTC/OmniBal发布。

Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval

Abstract

arXiv:2403.18405v2 Announce Type: replace Abstract: Determining which legal cases are relevant to a given query involves navigating lengthy texts and applying nuanced legal reasoning. Traditionally, this task has demanded significant time and domain expertise to identify key Legal Facts and reach sound juridical conclusions. In addition, existing data with legal case similarities often lack interpretability, making it difficult to understand the rationale behind relevance judgments. With the growing capabilities of large language models (LLMs), researchers have begun investigating their potential in this domain. Nonetheless, the method of employing a general large language model for reliable relevance judgments in legal case retrieval remains largely unexplored. To address this gap in research, we propose a novel few-shot approach where LLMs assist in generating expert-aligned interpretable relevance judgments. The proposed approach decomposes the judgment process into several stages, mimicking the workflow of human annotators and allowing for the flexible incorporation of expert reasoning to improve the accuracy of relevance judgments. Importantly, it also ensures interpretable data labeling, providing transparency and clarity in the relevance assessment process. Through a comparison of relevance judgments made by LLMs and human experts, we empirically demonstrate that the proposed approach can yield reliable and valid relevance assessments. Furthermore, we demonstrate that with minimal expert supervision, our approach enables a large language model to acquire case analysis expertise and subsequently transfers this ability to a smaller model via annotation-based knowledge distillation.

摘要

判定哪些法律案例与给定查询相关，需要处理冗长的文本并运用细致的法律推理。传统上，该任务需要耗费大量时间并依赖领域专业知识，以识别关键法律事实并得出合理的司法结论。此外，现有的法律案例相似性数据往往缺乏可解释性，导致难以理解相关性判断的依据。随着大语言模型（LLMs）能力的不断提升，研究者开始探索其在该领域的潜力。然而，如何利用通用大语言模型进行法律案例检索中可靠的相关性判断，目前仍鲜有研究。为填补这一空白，我们提出了一种新颖的小样本方法，通过大语言模型辅助生成与专家意见一致的可解释相关性判断。该方法将判断过程分解为多个阶段，模拟人类标注者的工作流程，并灵活融入专家推理以提高相关性判断的准确性。重要的是，该方法还确保了数据标注的可解释性，为相关性评估过程提供了透明度和清晰性。通过对比大语言模型与人类专家的相关性判断，我们实证表明所提方法能够产生可靠且有效的相关性评估。此外，我们还证明，在极少的专家监督下，该方法能使大语言模型获得案例分析的专业能力，进而通过基于标注的知识蒸馏将这种能力迁移至较小模型。

Differential Information: An Information-Theoretic Perspective on Preference Optimization

Abstract

arXiv:2505.23761v1 Announce Type: cross Abstract: Direct Preference Optimization (DPO) has become a standard technique for aligning language models with human preferences in a supervised manner. Despite its empirical success, the theoretical justification behind its log-ratio reward parameterization remains incomplete. In this work, we address this gap by utilizing the Differential Information Distribution (DID): a distribution over token sequences that captures the information gained during policy updates. First, we show that when preference labels encode the differential information required to transform a reference policy into a target policy, the log-ratio reward in DPO emerges as the uniquely optimal form for learning the target policy via preference optimization. This result naturally yields a closed-form expression for the optimal sampling distribution over rejected responses. Second, we find that the condition for preferences to encode differential information is fundamentally linked to an implicit assumption regarding log-margin ordered policies-an inductive bias widely used in preference optimization yet previously unrecognized. Finally, by analyzing the entropy of the DID, we characterize how learning low-entropy differential information reinforces the policy distribution, while high-entropy differential information induces a smoothing effect, which explains the log-likelihood displacement phenomenon. We validate our theoretical findings in synthetic experiments and extend them to real-world instruction-following datasets. Our results suggest that learning high-entropy differential information is crucial for general instruction-following, while learning low-entropy differential information benefits knowledge-intensive question answering. Overall, our work presents a unifying perspective on the DPO objective, the structure of preference data, and resulting policy behaviors through the lens of differential information.

摘要

直接偏好优化（DPO）已成为通过监督方式使语言模型与人类偏好对齐的标准技术。尽管其经验上取得了成功，但其对数比奖励参数化背后的理论依据仍不完善。在本研究中，我们通过利用差分信息分布（DID）——一种捕获策略更新过程中所获信息的令牌序列分布——来解决这一空白。首先，我们证明当偏好标签编码了将参考策略转化为目标策略所需的差分信息时，DPO中的对数比奖励作为通过偏好优化学习目标策略的唯一最优形式出现。这一结果自然推导出被拒绝响应最优采样分布的闭式表达式。其次，我们发现偏好编码差分信息的条件与对数边际有序策略的隐含假设存在根本联系——这是偏好优化中广泛使用却未被认识的归纳偏置。最后，通过分析DID的熵，我们刻画了学习低熵差分信息如何强化策略分布，而高熵差分信息则产生平滑效应，从而解释了对数似然位移现象。我们在合成实验中验证了理论发现，并将其扩展到现实世界指令跟随数据集。结果表明，学习高熵差分信息对通用指令跟随至关重要，而学习低熵差分信息则有益于知识密集型问答。总体而言，我们的工作通过差分信息视角，为DPO目标、偏好数据结构及由此产生的策略行为提供了统一的理论框架。

MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs

Abstract

arXiv:2408.09955v3 Announce Type: replace Abstract: LLM-based multi-agent systems (MAS) have shown promise in tackling complex tasks. However, existing solutions often suffer from limited agent coordination and heavy reliance on predefined Standard Operating Procedures (SOPs), which demand extensive human input. To address these limitations, we propose MegaAgent, a large-scale autonomous LLM-based multi-agent system. MegaAgent generates agents based on task complexity and enables dynamic task decomposition, parallel execution, efficient communication, and comprehensive system monitoring of agents. In evaluations, MegaAgent demonstrates exceptional performance, successfully developing a Gobang game within 800 seconds and scaling up to 590 agents in a national policy simulation to generate multi-domain policies. It significantly outperforms existing systems, such as MetaGPT, in both task completion efficiency and scalability. By eliminating the need for predefined SOPs, MegaAgent demonstrates exceptional scalability and autonomy, setting a foundation for advancing true autonomy in MAS. Our code is available at https://github.com/Xtra-Computing/MegaAgent .

摘要

基于大语言模型（LLM）的多智能体系统（MAS）在解决复杂任务方面展现出潜力。然而现有方案普遍存在智能体协作受限、过度依赖预定义标准操作流程（SOP）等问题，这些缺陷需要大量人工干预。为突破这些限制，我们提出MegaAgent——一个基于LLM的大规模自主多智能体系统。该系统能根据任务复杂度生成智能体，实现动态任务分解、并行执行、高效通信及全面的智能体系统监控。评估结果表明，MegaAgent展现出卓越性能：在800秒内成功开发五子棋游戏，并在国家政策模拟中扩展至590个智能体以生成多领域政策。其在任务完成效率和可扩展性方面显著优于MetaGPT等现有系统。通过消除对预定义SOP的依赖，MegaAgent展现出非凡的可扩展性和自主性，为推进MAS真正自主性奠定基础。代码已开源：https://github.com/Xtra-Computing/MegaAgent。

On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents

Abstract

arXiv:2408.00989v4 Announce Type: replace Abstract: Large language model-based multi-agent systems have shown great abilities across various tasks due to the collaboration of expert agents, each focusing on a specific domain. However, the impact of clumsy or even malicious agents--those who frequently make errors in their tasks--on the overall performance of the system remains underexplored. This paper investigates: (1) What is the resilience of various system structures (e.g., A $\rightarrow$ B $\rightarrow$ C, A $\leftrightarrow$ B $\leftrightarrow$ C) under faulty agents, on different downstream tasks? (2) How can we increase system resilience to defend against these agents? To simulate faulty agents, we propose two approaches--AutoTransform and AutoInject--which introduce mistakes into the agents' responses. Experiments on four downstream tasks using six systems show that the "hierarchical" structure, i.e., A $\rightarrow$ (B $\leftrightarrow$ C), exhibits superior resilience with the lowest performance drop of 5.5%, compared to 10.5% and 23.7% of other two structures. To further improve resilience, we introduce (1) Challenger, that introduces a mechanism for each agent to challenge others' outputs, and (2) Inspector, an additional agent to review and correct messages, recovering up to 96.4% errors made by faulty agents. Our code and data are available at https://github.com/CUHK-ARISE/MAS-Resilience.

摘要

基于大语言模型的多智能体系统凭借各领域专家代理的协作，在各类任务中展现出卓越能力。然而，频繁出现任务错误的笨拙甚至恶意代理对系统整体性能的影响尚未得到充分研究。本文探究：(1)不同系统结构（如A→B→C、A↔B↔C）在存在故障代理时，对下游任务的抗干扰能力如何？(2)如何提升系统防御此类代理的韧性？我们提出AutoTransform和AutoInject两种方法模拟故障代理，通过向代理响应注入错误进行实验。在四个下游任务上对六种系统的测试表明，'分层'结构（A→(B↔C)）表现出最优韧性，性能仅下降5.5%，显著优于其他两种结构的10.5%和23.7%。为增强系统韧性，我们提出：(1)挑战者机制——使每个代理能质疑其他代理的输出；(2)审查者代理——新增代理负责复核修正消息，可修复故障代理96.4%的错误。代码与数据详见https://github.com/CUHK-ARISE/MAS-Resilience。

Abstract

arXiv:2410.08948v2 Announce Type: replace Abstract: Social conventions are the backbone of social coordination, shaping how individuals form a group. As growing populations of artificial intelligence (AI) agents communicate through natural language, a fundamental question is whether they can bootstrap the foundations of a society. Here, we present experimental results that demonstrate the spontaneous emergence of universally adopted social conventions in decentralized populations of large language model (LLM) agents. We then show how strong collective biases can emerge during this process, even when agents exhibit no bias individually. Last, we examine how committed minority groups of adversarial LLM agents can drive social change by imposing alternative social conventions on the larger population. Our results show that AI systems can autonomously develop social conventions without explicit programming and have implications for designing AI systems that align, and remain aligned, with human values and societal goals.

摘要

社会惯例是社会协调的支柱，决定着个体如何形成群体。随着越来越多的人工智能（AI）代理通过自然语言进行交流，一个根本性问题在于它们能否构建社会的基础。本文通过实验证明，在去中心化的大型语言模型（LLM）代理群体中会自发形成普遍采纳的社会惯例。我们进一步揭示，即使个体代理未表现出任何偏见，这一过程中仍可能产生强烈的集体偏见。最后，我们探讨了具有对抗性的LLM代理少数群体如何通过向主流群体强加替代性社会惯例来推动社会变革。研究结果表明，AI系统无需显式编程即可自主发展社会惯例，这对设计符合并持续符合人类价值观与社会目标的AI系统具有重要启示。

Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents

Abstract

arXiv:2408.16081v2 Announce Type: replace Abstract: Large language models (LLMs) are increasingly explored as general-purpose reasoners, particularly in agentic contexts. However, their outputs remain prone to mathematical and logical errors. This is especially challenging in open-ended tasks, where unstructured outputs lack explicit ground truth and may contain subtle inconsistencies. To address this issue, we propose Logic-Enhanced Language Model Agents (LELMA), a framework that integrates LLMs with formal logic to enable validation and refinement of natural language reasoning. LELMA comprises three components: an LLM-Reasoner, an LLM-Translator, and a Solver, and employs autoformalization to translate reasoning into logic representations, which are then used to assess logical validity. Using game-theoretic scenarios such as the Prisoner's Dilemma as testbeds, we highlight the limitations of both less capable (Gemini 1.0 Pro) and advanced (GPT-4o) models in generating logically sound reasoning. LELMA achieves high accuracy in error detection and improves reasoning correctness via self-refinement, particularly in GPT-4o. The study also highlights challenges in autoformalization accuracy and in evaluation of inherently ambiguous open-ended reasoning tasks.

摘要

大型语言模型（LLMs）正日益被探索作为通用推理工具，尤其在智能体应用场景中。然而，其输出仍易出现数学与逻辑错误。这一问题在开放式任务中尤为突出，因为非结构化输出缺乏明确基准真值，且可能包含微妙的逻辑不一致。为解决该问题，我们提出逻辑增强语言模型智能体（LELMA），该框架通过将LLMs与形式逻辑相结合，实现对自然语言推理的验证与优化。LELMA包含三个组件：LLM推理器、LLM翻译器和逻辑求解器，采用自动形式化技术将推理过程转化为逻辑表示，进而评估其逻辑有效性。以囚徒困境等博弈论场景作为测试平台，我们揭示了能力较弱模型（Gemini 1.0 Pro）与先进模型（GPT-4o）在生成逻辑严密推理方面的局限性。LELMA在错误检测方面表现出高准确率，并通过自我优化显著提升了GPT-4o的推理正确性。研究同时揭示了自动形式化准确性及对固有模糊性开放式推理任务评估方面存在的挑战。

CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning

Abstract

arXiv:2410.02229v2 Announce Type: replace Abstract: Large language models (LLMs) have made significant progress in natural language understanding and generation, driven by scalable pretraining and advanced finetuning. However, enhancing reasoning abilities in LLMs, particularly via reinforcement learning from human feedback (RLHF), remains challenging due to the scarcity of high-quality preference data, which is labor-intensive to annotate and crucial for reward model (RM) finetuning. To alleviate this issue, we introduce CodePMP, a scalable preference model pretraining (PMP) pipeline that utilizes a large corpus of synthesized code-preference pairs from publicly available high-quality source code. CodePMP improves RM finetuning efficiency by pretraining preference models on large-scale synthesized code-preference pairs. We evaluate CodePMP on mathematical reasoning tasks (GSM8K, MATH) and logical reasoning tasks (ReClor, LogiQA2.0), consistently showing significant improvements in reasoning performance of LLMs and highlighting the importance of scalable preference model pretraining for efficient reward modeling.

摘要

大型语言模型（LLMs）通过可扩展的预训练和先进的微调技术，在自然语言理解与生成领域取得了显著进展。然而，由于高质量偏好数据的稀缺性——这类数据不仅标注成本高昂，且对奖励模型（RM）微调至关重要——如何通过人类反馈强化学习（RLHF）提升LLMs的推理能力仍具挑战性。为此，我们提出CodePMP：一种基于公开高质量源代码构建的大规模合成代码-偏好对的偏好模型预训练（PMP）流程。该方法通过在大规模合成代码-偏好对上预训练偏好模型，显著提升了奖励模型的微调效率。我们在数学推理任务（GSM8K、MATH）和逻辑推理任务（ReClor、LogiQA2.0）上评估CodePMP，结果表明其能持续提升LLMs的推理性能，这凸显了可扩展偏好模型预训练对高效奖励建模的重要性。

GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation

Abstract

arXiv:2410.08475v3 Announce Type: replace Abstract: Existing approaches based on context prompting or reinforcement learning (RL) to improve the reasoning capacities of large language models (LLMs) depend on the LLMs' internal knowledge to produce reliable Chain-Of-Thought (CoT). However, no matter the size of LLMs, certain problems cannot be resolved in a single forward pass. Meanwhile, agent-based reasoning systems require access to a comprehensive nonparametric knowledge base, which is often costly or not feasible for use in scientific and niche domains. We present Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning method that merges parametric and non-parametric memories to improve accurate reasoning with minimal external input. GIVE guides the LLM agent to select the most pertinent expert data (observe), engage in query-specific divergent thinking (reflect), and then synthesize this information to produce the final output (speak). Extensive experiments demonstrated the following benefits of our framework: (1) GIVE boosts the performance of LLMs across various sizes. (2) In some scenarios, GIVE allows smaller LLMs to surpass larger, more sophisticated ones in scientific tasks (GPT3.5T + GIVE > GPT4). (3) GIVE is effective on scientific and open-domain assessments. (4) GIVE is a training-free method that enables LLMs to tackle new problems that extend beyond their training data (up to 43.5% -> 88.2%} accuracy improvement). (5) GIVE allows LLM agents to reason using both restricted (very small) and noisy (very large) knowledge sources, accommodating knowledge graphs (KG) ranging from 135 to more than 840k nodes. (6) The reasoning process involved in GIVE is fully interpretable.

摘要

现有基于上下文提示或强化学习（RL）的方法旨在提升大语言模型（LLMs）的推理能力，但其依赖模型内部知识生成可靠思维链（CoT）。然而无论模型规模大小，某些问题无法通过单次前向推理解决。同时，基于智能体的推理系统需要访问全面的非参数化知识库，这在科学和垂直领域往往成本高昂或难以实现。我们提出图启发真实性外推法（GIVE），这是一种融合参数化与非参数化记忆的新型推理方法，能以最小外部输入提升准确推理能力。GIVE引导LLM智能体选择最相关的专家数据（观察阶段），进行查询导向的发散思考（反思阶段），最终整合信息生成输出（表达阶段）。大量实验表明本框架具有以下优势：（1）GIVE能提升不同规模LLMs的性能；（2）某些科学任务中，GIVE可使较小LLMs超越更复杂的大模型（GPT3.5T+GIVE>GPT4）；（3）GIVE在科学和开放领域评估中均有效；（4）作为无需训练的方法，GIVE使LLMs能处理超出训练数据范围的新问题（准确率提升达43.5%→88.2%）；（5）GIVE支持LLM智能体利用受限（极小）和噪声（极大）知识源进行推理，适配135至84万节点的知识图谱（KG）；（6）GIVE的推理过程完全可解释。

Generative Agents for Multi-Agent Autoformalization of Interaction Scenarios

Abstract

arXiv:2412.08805v3 Announce Type: replace Abstract: Multi-agent simulations are versatile tools for exploring interactions among natural and artificial agents, but their development typically demands domain expertise and manual effort. This work introduces the Generative Agents for Multi-Agent Autoformalization (GAMA) framework, which automates the formalization of interaction scenarios in simulations using agents augmented with large language models (LLMs). To demonstrate the application of GAMA, we use natural language descriptions of game-theoretic scenarios representing social interactions, and we autoformalize them into executable logic programs defining game rules, with syntactic correctness enforced through a solver-based validation. To ensure runtime validity, an iterative, tournament-based procedure tests the generated rules and strategies, followed by exact semantic validation when ground truth outcomes are available. In experiments with 110 natural language descriptions across five 2x2 simultaneous-move games, GAMA achieves 100% syntactic and 76.5% semantic correctness with Claude 3.5 Sonnet, and 99.82% syntactic and 77% semantic correctness with GPT-4o. The framework also shows high semantic accuracy in autoformalizing agents' strategies.

摘要

多智能体仿真是探索自然与人工智能体间交互的通用工具，但其开发通常需要领域专业知识与人工投入。本研究提出用于多智能体自动形式化的生成式智能体框架（GAMA），该框架通过增强大型语言模型（LLMs）的智能体，实现了仿真中交互场景的自动化形式化。为展示GAMA的应用，我们使用描述社会交互的博弈论场景自然语言文本，将其自动形式化为定义游戏规则的可执行逻辑程序，并通过基于求解器的验证确保语法正确性。为保证运行时有效性，采用基于锦标赛的迭代流程测试生成规则与策略，在存在真实结果时进行精确语义验证。在涵盖五种2x2同步移动游戏的110个自然语言描述实验中，GAMA使用Claude 3.5 Sonnet实现100%语法正确率与76.5%语义正确率，使用GPT-4o实现99.82%语法正确率与77%语义正确率。该框架在智能体策略自动形式化方面亦展现出较高的语义准确性。

Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction

Abstract

arXiv:2502.17541v2 Announce Type: replace Abstract: Interpreting data is central to modern research. Large language models (LLMs) show promise in providing such natural language interpretations of data, yet simple feature extraction methods such as prompting often fail to produce accurate and versatile descriptions for diverse datasets and lack control over granularity and scale. To address these limitations, we propose a domain-agnostic method for dataset featurization that provides precise control over the number of features extracted while maintaining compact and descriptive representations comparable to human labeling. Our method optimizes the selection of informative binary features by evaluating the ability of an LLM to reconstruct the original data using those features. We demonstrate its effectiveness in dataset modeling tasks and through two case studies: (1) Constructing a feature representation of jailbreak tactics that compactly captures both the effectiveness and diversity of a larger set of human-crafted attacks; and (2) automating the discovery of features that align with human preferences, achieving accuracy and robustness comparable to human-crafted features. Moreover, we show that the pipeline scales effectively, improving as additional features are sampled, making it suitable for large and diverse datasets.

摘要

数据解释是现代研究的核心任务。大型语言模型（LLMs）在提供数据自然语言解释方面展现出潜力，但简单的特征提取方法（如提示）往往无法为多样化数据集生成准确且通用的描述，且难以控制粒度和规模。针对这些局限性，我们提出一种领域无关的数据集特征化方法，可在保持与人工标注相当的紧凑性和描述性的同时，精确控制提取特征的数量。该方法通过评估LLM利用选定特征重构原始数据的能力，优化信息性二元特征的选择。我们通过数据集建模任务和两个案例研究验证其有效性：（1）构建越狱策略的特征表示，紧凑地捕捉大规模人工构造攻击的有效性和多样性；（2）自动化发现符合人类偏好的特征，其准确性和鲁棒性达到人工构造特征水平。此外，我们证明该管道具备良好的可扩展性，随着特征采样数量的增加而持续改进，适用于大规模多样化数据集。

ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA

Abstract

arXiv:2503.06951v2 Announce Type: replace Abstract: Recent advances in large language models (LLMs) have significantly improved multi-hop question answering (QA) through direct Chain-of-Thought (CoT) reasoning. However, the irreversible nature of CoT leads to error accumulation, making it challenging to correct mistakes in multi-hop reasoning. This paper introduces ReAgent: a Reversible multi-Agent collaborative framework augmented with explicit backtracking mechanisms, enabling reversible multi-hop reasoning. By incorporating text-based retrieval, information aggregation and validation, our system can detect and correct errors mid-reasoning, leading to more robust and interpretable QA outcomes. The framework and experiments serve as a foundation for future work on error-tolerant QA systems. Empirical evaluations across three benchmarks indicate ReAgent's efficacy, yielding average about 6% improvements against baseline models.

摘要

大型语言模型（LLM）的最新进展通过直接思维链（CoT）推理显著提升了多跳问答（QA）性能。然而，CoT的不可逆性会导致错误累积，使得多跳推理中的错误难以纠正。本文提出ReAgent：一种可逆的多智能体协作框架，通过显式回溯机制增强，实现可逆的多跳推理。该系统通过基于文本的检索、信息聚合与验证，能够在推理过程中检测并纠正错误，从而获得更鲁棒且可解释的QA结果。该框架及实验为未来容错QA系统的研究奠定了基础。在三个基准测试上的实证评估表明，ReAgent具有显著效果，相较基线模型平均提升约6%。

BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent

Abstract

arXiv:2503.09663v2 Announce Type: replace Abstract: Operating System (OS) kernel tuning involves systematically adjusting kernel configurations to optimize system performance. Despite recent advancements in large language models (LLMs), kernel tuning remains a critical challenge due to: (1) the semantic gap between abstract tuning objective and concrete config options, (2) insufficient environmental interaction induces LLM hallucinations, and (3) the rapid evolution of kernel versions. To address these challenges, we propose BYOS, a LLM-powered framework that automates kernel tuning through three key innovations: structured knowledge construction and mapping, knowledge-driven configuration generation, and continuous knowledge maintenance. Extensive experiments show that BYOS achieves 7.1%-155.4% performance improvements over default configurations across standard OS benchmarks and real-world applications, demonstrating structured knowledge representation can overcome key limitations of pure LLM solutions for system optimization. Our code is available at https://github.com/LHY-24/BYOS.

摘要

操作系统内核调优涉及通过系统性调整内核配置以优化系统性能。尽管大型语言模型（LLMs）近期取得进展，内核调优仍面临关键挑战：(1) 抽象调优目标与具体配置选项间的语义鸿沟，(2) 环境交互不足导致LLM产生幻觉输出，(3) 内核版本快速迭代。为解决这些问题，我们提出BYOS框架，该LLM驱动系统通过三大创新实现自动化调优：结构化知识构建与映射、知识驱动的配置生成、持续知识维护。大量实验表明，在标准操作系统基准测试和实际应用中，BYOS相较默认配置实现7.1%-155.4%的性能提升，证明结构化知识表征能克服纯LLM方案在系统优化中的关键局限。代码已开源：https://github.com/LHY-24/BYOS。

am-ELO: A Stable Framework for Arena-based LLM Evaluation

Abstract

arXiv:2505.03475v2 Announce Type: replace Abstract: Arena-based evaluation is a fundamental yet significant evaluation paradigm for modern AI models, especially large language models (LLMs). Existing framework based on ELO rating system suffers from the inevitable instability problem due to ranking inconsistency and the lack of attention to the varying abilities of annotators. In this paper, we introduce a novel stable arena framework to address these issues by enhancing the ELO Rating System. Specifically, we replace the iterative update method with a Maximum Likelihood Estimation (MLE) approach, m-ELO, and provide theoretical proof of the consistency and stability of the MLE approach for model ranking. Additionally, we proposed the am-ELO, which modify the Elo Rating's probability function to incorporate annotator abilities, enabling the simultaneous estimation of model scores and annotator reliability. Experiments demonstrate that this method ensures stability, proving that this framework offers a more robust, accurate, and stable evaluation method for LLMs.

摘要

基于竞技场的评估是现代人工智能模型（尤其是大语言模型）的一项基础而重要的评估范式。现有基于ELO评分体系的框架因排名不一致性及忽视标注者能力差异，存在不可避免的稳定性问题。本文提出一种新型稳定竞技场框架，通过改进ELO评分系统来解决这些问题。具体而言，我们采用最大似然估计方法（m-ELO）替代迭代更新方式，并从理论上证明了该方法的模型排名一致性与稳定性。此外，我们提出am-ELO方法，通过修正ELO评分的概率函数来纳入标注者能力因素，实现模型得分与标注者可靠性的同步估计。实验表明该方法能确保稳定性，证明该框架为大语言模型提供了更鲁棒、精确且稳定的评估方法。

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking

Abstract

arXiv:2505.02322v2 Announce Type: replace Abstract: Recent advancements have significantly enhanced the performance of large language models (LLMs) in tackling complex reasoning tasks, achieving notable success in domains like mathematical and logical reasoning. However, these methods encounter challenges with complex planning tasks, primarily due to extended reasoning steps, diverse constraints, and the challenge of handling multiple distinct sub-tasks. To address these challenges, we propose HyperTree Planning (HTP), a novel reasoning paradigm that constructs hypertree-structured planning outlines for effective planning. The hypertree structure enables LLMs to engage in hierarchical thinking by flexibly employing the divide-and-conquer strategy, effectively breaking down intricate reasoning steps, accommodating diverse constraints, and managing multiple distinct sub-tasks in a well-organized manner. We further introduce an autonomous planning framework that completes the planning process by iteratively refining and expanding the hypertree-structured planning outlines. Experiments demonstrate the effectiveness of HTP, achieving state-of-the-art accuracy on the TravelPlanner benchmark with Gemini-1.5-Pro, resulting in a 3.6 times performance improvement over o1-preview.

摘要

近期研究显著提升了大型语言模型（LLMs）在复杂推理任务中的表现，尤其在数学与逻辑推理领域取得显著成果。然而，这些方法在复杂规划任务中仍面临挑战，主要源于推理步骤冗长、约束条件多样以及多子任务协同处理的困难。为此，我们提出超树规划（HTP）——一种通过构建超树结构规划纲要来实现高效推理的新范式。该结构使LLMs能够灵活运用分治策略进行层次化思考，从而有效分解复杂推理步骤、适配多样化约束条件，并以系统化方式处理多个独立子任务。我们进一步提出自主规划框架，通过迭代优化与扩展超树结构规划纲要来完成规划过程。实验证明HTP具有显著效果，在TravelPlanner基准测试中使用Gemini-1.5-Pro实现了当前最优精度，性能较o1-preview版本提升3.6倍。

Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts

Abstract

arXiv:2404.05019v3 Announce Type: replace-cross Abstract: Expert parallelism has emerged as a key strategy for distributing the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple devices, enabling the processing of increasingly large-scale models. However, the All-to-All communication inherent to expert parallelism poses a significant bottleneck, limiting the efficiency of MoE models. Although existing optimization methods partially mitigate this issue, they remain constrained by the sequential dependency between communication and computation operations. To address this challenge, we propose ScMoE, a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy. ScMoE decouples communication from its conventional sequential ordering, enabling up to 100% overlap with computation. Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of 1.49 times in training and 1.82 times in inference. Moreover, our experiments and analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches.

摘要

专家并行已成为将稀疏门控专家混合模型（MoE）的计算负载分配到多设备的关键策略，使得处理日益大规模模型成为可能。然而专家并行固有的全对全通信机制构成了显著瓶颈，制约了MoE模型的效率。尽管现有优化方法能部分缓解此问题，但仍受限于通信与计算操作间的顺序依赖关系。为解决这一挑战，我们提出ScMoE——一种集成重叠并行化策略的新型快捷连接MoE架构。ScMoE将通信从传统顺序执行中解耦，最高可实现与计算100%的重叠。相较于主流top-2 MoE基线方法，ScMoE在训练和推理阶段分别实现1.49倍和1.82倍的加速。此外，实验与分析表明，ScMoE不仅达到且在某些情况下超越了现有方法的模型质量。

ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation

Abstract

arXiv:2405.17057v2 Announce Type: replace-cross Abstract: Code generation plays a crucial role in various tasks, such as code auto-completion and mathematical reasoning. Previous work has proposed numerous methods to enhance code generation performance, including integrating feedback from the compiler. Inspired by this, we present ReflectionCoder, a novel approach that effectively leverages reflection sequences constructed by integrating compiler feedback to improve one-off code generation performance. Furthermore, we propose reflection self-distillation and dynamically masked distillation to effectively utilize these reflection sequences. Extensive experiments on three benchmarks, i.e., HumanEval (+), MBPP (+), and MultiPL-E, demonstrate that models fine-tuned with our method achieve state-of-the-art performance. Beyond the code domain, we believe this approach can benefit other domains that focus on final results and require long reasoning paths. Code and data are available at https://github.com/SenseLLM/ReflectionCoder.

摘要

代码生成在代码自动补全和数学推理等多项任务中具有关键作用。先前研究提出了多种提升代码生成性能的方法，包括整合编译器反馈机制。受此启发，我们提出ReflectionCoder——一种通过有效利用编译器反馈构建的反射序列来提升单次代码生成性能的新方法。此外，我们提出反射自蒸馏和动态掩码蒸馏技术以高效利用这些反射序列。在HumanEval (+)、MBPP (+)和MultiPL-E三个基准测试上的大量实验表明，采用本方法微调的模型实现了最先进的性能。除代码领域外，我们认为该方法同样适用于其他注重最终结果且需要长推理路径的领域。代码与数据详见https://github.com/SenseLLM/ReflectionCoder。

X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents

Abstract

arXiv:2408.09853v2 Announce Type: replace-cross Abstract: The Turing test examines whether AIs exhibit human-like behaviour in natural language conversations. The traditional setting limits each participant to one message at a time and requires constant human participation. This fails to reflect a natural conversational style and hinders the evaluation of dialogue agents based on Large Language Models (LLMs) in complex and prolonged interactions. This paper proposes \textbf{\textsc{X-Turing}}, which enhances the original test with a \textit{burst dialogue} pattern, allowing more dynamic exchanges using consecutive messages. It further reduces human workload by iteratively generating dialogues that simulate the long-term interaction between the agent and a human to compose the majority of the test process. With the \textit{pseudo-dialogue} history, the agent then engages in a shorter dialogue with a real human, which is paired with a human-human conversation on the same topic to be judged using questionnaires. We introduce the \textit{X-Turn Pass-Rate} metric to assess the human likeness of LLMs across varying durations. While LLMs like GPT-4 initially perform well, achieving pass rates of 51.9% and 38.9% during 3 turns and 10 turns of dialogues respectively, their performance drops as the dialogue progresses, which underscores the difficulty in maintaining consistency in the long term.

摘要

图灵测试通过自然语言对话检验人工智能是否展现出类人行为。传统测试设置限制每位参与者每次仅能发送一条消息，且需持续人类参与。这既无法反映自然对话风格，也阻碍了基于大语言模型（LLMs）的对话代理在复杂长期交互中的评估。本文提出\textbf{\textsc{X-Turing}}测试，通过引入\textit{爆发式对话}模式改进原测试，允许参与者使用连续消息进行更动态的交流。该方法进一步通过迭代生成模拟代理与人类长期交互的对话来降低人类工作量，构成测试流程的主体部分。借助\textit{伪对话}历史记录，代理随后与真实人类进行较短对话，该对话将与同主题的人-人对话配对，通过问卷进行评判。我们提出\textit{X轮次通过率}指标来评估LLMs在不同时长下的类人程度。实验表明，虽然GPT-4等LLMs在初始阶段表现良好（3轮和10轮对话通过率分别为51.9%和38.9%），但随着对话持续其性能会下降，这凸显了长期保持一致性的困难。

GWQ: Gradient-Aware Weight Quantization for Large Language Models

Abstract

arXiv:2411.00850v4 Announce Type: replace-cross Abstract: Large language models (LLMs) show impressive performance in solving complex language tasks. However, its large number of parameters presents significant challenges for the deployment. So, compressing LLMs to low bits can enable to deploy on resource-constrained devices. To address this problem, we propose gradient-aware weight quantization (GWQ), the first quantization approach for low-bit weight quantization that leverages gradients to localize outliers, requiring only a minimal amount of calibration data for outlier detection. GWQ retains the top 1% outliers preferentially at FP16 precision, while the remaining non-outlier weights are stored in a low-bit. We widely evaluate GWQ on different task include language modeling, grounding detection, massive multitask language understanding and vision-language question and answering. Results show that models quantified by GWQ performs better than other quantization method. During quantization process, GWQ only need one calibration set to realize effective quant. Also, GWQ achieves 1.2x inference speedup in comparison to the original model and effectively reduces the inference memory.

摘要

大型语言模型（LLMs）在解决复杂语言任务方面展现出卓越性能，但其海量参数给部署带来巨大挑战。为此，将LLMs压缩至低比特位能使其在资源受限设备上运行。针对该问题，我们提出梯度感知权重量化（GWQ）方法，这是首个利用梯度定位异常值的低比特权重量化方案，仅需极少量校准数据即可完成异常检测。GWQ优先保留前1%异常值以FP16精度存储，其余非异常权重则采用低比特存储。我们在语言建模、基础检测、大规模多任务语言理解及视觉语言问答等任务上对GWQ进行全面评估。结果表明，经GWQ量化的模型性能优于其他量化方法。量化过程中，GWQ仅需单组校准数据即可实现有效量化。相比原始模型，GWQ可实现1.2倍推理加速，并显著降低推理内存占用。

Personality-Guided Code Generation Using Large Language Models

Abstract

arXiv:2411.00006v2 Announce Type: replace-cross Abstract: Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance. We open-source our code and data at https://github.com/IanWalls/Persona-Code.

摘要

代码生成（即根据自然语言描述自动创建源代码）因其简化软件开发的潜力而受到广泛关注。受任务-人格匹配能提升开发效果的研究启发，我们针对基于大语言模型（LLMs）的人格引导代码生成开展了实证研究。具体而言，我们探究了模拟适合编码任务的人格特质如何影响LLM的表现。通过使用7个广泛采用的LLM在4个代表性数据集上进行全面评估，结果表明人格引导显著提高了代码生成准确率——在28个LLM-数据集组合中有23个实现了通过率提升。其中11个案例提升超过5%，5个实例超过10%，最高增益达12.9%。此外，人格引导可轻松与其他提示策略结合以进一步提升性能。我们在https://github.com/IanWalls/Persona-Code开源了代码与数据。

SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression

Abstract

arXiv:2410.09615v3 Announce Type: replace-cross Abstract: Conventional model compression techniques for LLMs address high memory consumption and slow inference challenges but typically require computationally expensive retraining to preserve accuracy. In contrast, one-shot compression methods eliminate retraining cost, but struggle to achieve accuracy comparable to dense models. This paper presents SLIM, a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation into a unified process. First, we formulate the quantization process using a probabilistic approach (SLIM-Quant) that enables us to apply uniform quantization. Then, we use an existing one-shot pruning method to apply semi-structured sparsity on top of the quantized weights. Finally, to compensate for the introduced aggregated quantization and sparsity error, we use a novel saliency function with unique invertible and additive features that enables us to mathematically compute the value of low-rank adapters. SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods. Models compressed with SLIM achieve up to 4.3x and 3.8x on Nvidia RTX3060 and A100 GPUs, respectively. Additionally, they achieve up to 0.23x end-to-end memory reduction in comparison to their dense counterparts. We also propose an optional PEFT recipe that further improves accuracy by up to 1.66% (LLaMA-2-13B) compared to SLIM without fine-tuning.

摘要

传统的大型语言模型压缩技术虽然能解决高内存消耗和推理速度慢的问题，但通常需要计算成本高昂的重新训练以保持精度。相比之下，单次压缩方法虽消除了重新训练成本，却难以达到与稠密模型相当的准确度。本文提出SLIM这一新型单次压缩框架，将硬件友好的量化、稀疏化和低秩近似整体集成到统一流程中。首先，我们采用概率化方法（SLIM-Quant）构建量化过程，从而实现均匀量化。随后，利用现有单次剪枝方法在量化权重上施加半结构化稀疏。最后，为补偿累积的量化与稀疏误差，我们设计了一种具有可逆性和可加性特征的新型显著性函数，通过数学计算确定低秩适配器的值。SLIM在4比特权重量化的2:4稀疏条件下，将LLaMA-2-7B模型的准确率最高提升5.66%，优于现有方法。经SLIM压缩的模型在NVIDIA RTX3060和A100 GPU上分别实现4.3倍和3.8倍加速，相较稠密模型最高减少0.23倍端到端内存占用。我们还提出可选用的参数高效微调方案，在LLaMA-2-13B模型上较未微调的SLIM进一步提升1.66%准确率。

SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains

Abstract

arXiv:2411.06426v3 Announce Type: replace-cross Abstract: As the integration of the Large Language Models (LLMs) into various applications increases, so does their susceptibility to misuse, raising significant security concerns. Numerous jailbreak attacks have been proposed to assess the security defense of LLMs. Current jailbreak attacks mainly rely on scenario camouflage, prompt obfuscation, prompt optimization, and prompt iterative optimization to conceal malicious prompts. In particular, sequential prompt chains in a single query can lead LLMs to focus on certain prompts while ignoring others, facilitating context manipulation. This paper introduces SequentialBreak, a novel jailbreak attack that exploits this vulnerability. We discuss several scenarios, not limited to examples like Question Bank, Dialog Completion, and Game Environment, where the harmful prompt is embedded within benign ones that can fool LLMs into generating harmful responses. The distinct narrative structures of these scenarios show that SequentialBreak is flexible enough to adapt to various prompt formats beyond those discussed. Extensive experiments demonstrate that SequentialBreak uses only a single query to achieve a substantial gain of attack success rate over existing baselines against both open-source and closed-source models. Through our research, we highlight the urgent need for more robust and resilient safeguards to enhance LLM security and prevent potential misuse. All the result files and website associated with this research are available in this GitHub repository: https://anonymous.4open.science/r/JailBreakAttack-4F3B/.

摘要

随着大型语言模型（LLMs）在各应用中的集成度提高，其被滥用的风险也随之上升，引发了重大安全隐患。为评估LLMs的安全防御能力，研究者已提出多种越狱攻击方法。现有攻击主要依赖场景伪装、提示混淆、提示优化及提示迭代优化等技术来隐藏恶意指令。值得注意的是，单次查询中的连续提示链会导致LLMs聚焦特定提示而忽略其他内容，从而为上下文操控创造条件。本文提出一种新型越狱攻击方法SequentialBreak，专门利用该漏洞进行研究。我们探讨了题库模拟、对话补全、游戏环境等多种场景（不限于所举示例），通过将有害指令嵌入良性提示中诱导模型生成恶意响应。这些场景独特的叙事结构表明，SequentialBreak能灵活适配超出本文讨论范畴的多样化提示格式。大量实验证明，针对开源与闭源模型，SequentialBreak仅需单次查询即可实现攻击成功率的大幅提升，显著优于现有基线方法。本研究揭示了当前亟需构建更鲁棒、更具韧性的防护机制以增强LLMs安全性，防范潜在滥用风险。相关结果文件及研究网站详见GitHub仓库：https://anonymous.4open.science/r/JailBreakAttack-4F3B/

Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs

Abstract

arXiv:2410.11001v2 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) has revitalized Large Language Models (LLMs) by injecting non-parametric factual knowledge. Compared with long-context LLMs, RAG is considered an effective summarization tool in a more concise and lightweight manner, which can interact with LLMs multiple times using diverse queries to get comprehensive responses. However, the LLM-generated historical responses, which contain potentially insightful information, are largely neglected and discarded by existing approaches, leading to suboptimal results. In this paper, we propose $\textit{graph of records}$ ( $\textbf{GoR}$ ), which leverages historical responses generated by LLMs to enhance RAG for long-context global summarization. Inspired by the $\textit{retrieve-then-generate}$ paradigm of RAG, we construct a graph by establishing an edge between the retrieved text chunks and the corresponding LLM-generated response. To further uncover the intricate correlations between them, GoR features a $\textit{graph neural network}$ and an elaborately designed $\textit{BERTScore}$ -based objective for self-supervised model training, enabling seamless supervision signal backpropagation between reference summaries and node embeddings. We comprehensively compare GoR with 12 baselines across four long-context summarization datasets, and the results indicate that our proposed method reaches the best performance ( $\textit{e.g.}$ , 15%, 8%, and 19% improvement over retrievers w.r.t. Rouge-L, Rouge-1, and Rouge-2 on the WCEP dataset). Extensive experiments further demonstrate the effectiveness of GoR.

摘要

检索增强生成（RAG）技术通过注入非参数化事实知识，为大规模语言模型（LLM）注入了新的活力。与长上下文LLM相比，RAG被认为能以更简洁轻量的方式实现高效摘要，其可通过多样化查询与LLM进行多次交互以获取全面响应。然而现有方法大多忽视并丢弃了LLM生成的历史响应（这些响应可能包含重要洞察信息），导致结果欠佳。本文提出记录图（GoR），利用LLM生成的历史响应来增强长上下文全局摘要任务中的RAG性能。受RAG"检索-生成"范式启发，我们通过在检索文本块与对应LLM生成响应间建立边来构建图结构。为深入挖掘其复杂关联，GoR采用图神经网络和精心设计的基于BERTScore的自监督训练目标，实现参考摘要与节点嵌入间监督信号的无缝反向传播。我们在四个长上下文摘要数据集上对GoR与12个基线方法进行全面对比，结果表明所提方法达到最佳性能（例如在WCEP数据集上Rouge-L、Rouge-1和Rouge-2指标分别较检索器提升15%、8%和19%）。大量实验进一步验证了GoR的有效性。

SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation

Abstract

arXiv:2412.15272v2 Announce Type: replace-cross Abstract: Recent advancements in large language models (LLMs) have shown impressive versatility across various tasks. To eliminate their hallucinations, retrieval-augmented generation (RAG) has emerged as a powerful approach, leveraging external knowledge sources like knowledge graphs (KGs). In this paper, we study the task of KG-driven RAG and propose a novel Similar Graph Enhanced Retrieval-Augmented Generation (SimGRAG) method. It effectively addresses the challenge of aligning query texts and KG structures through a two-stage process: (1) query-to-pattern, which uses an LLM to transform queries into a desired graph pattern, and (2) pattern-to-subgraph, which quantifies the alignment between the pattern and candidate subgraphs using a graph semantic distance (GSD) metric. We also develop an optimized retrieval algorithm that efficiently identifies the top-k subgraphs within 1-second on a 10-million-scale KG. Extensive experiments show that SimGRAG outperforms state-of-the-art KG-driven RAG methods in both question answering and fact verification. Our code is available at https://github.com/YZ-Cai/SimGRAG.

摘要

近年来，大型语言模型（LLM）的进展展现出跨任务处理的卓越通用性。为消除其幻觉问题，检索增强生成（RAG）通过利用知识图谱（KG）等外部知识源，已成为一种有效解决方案。本文研究KG驱动的RAG任务，提出新型相似图增强检索生成方法（SimGRAG）。该方法通过两阶段流程有效解决查询文本与KG结构的对齐难题：（1）查询到模式阶段，使用LLM将查询转换为目标图模式；（2）模式到子图阶段，采用图语义距离（GSD）度量量化模式与候选子图的对齐程度。我们还开发了优化检索算法，可在千万级KG上实现1秒内Top-k子图检索。大量实验表明，SimGRAG在问答和事实核查任务中均优于当前最先进的KG驱动RAG方法。代码已开源：https://github.com/YZ-Cai/SimGRAG。

Abstract

arXiv:2411.18212v2 Announce Type: replace-cross Abstract: Path planning under wireless performance constraints is a complex challenge in robot navigation. However, naively incorporating such constraints into classical planning algorithms often incurs prohibitive search costs. In this paper, we propose SCoTT, a wireless-aware path planning framework that leverages vision-language models (VLMs) to co-optimize average path gains and trajectory length using wireless heatmap images and ray-tracing data from a digital twin (DT). At the core of our framework is Strategic Chain-of-Thought Tasking (SCoTT), a novel prompting paradigm that decomposes the exhaustive search problem into structured subtasks, each solved via chain-of-thought prompting. To establish strong baselines, we compare classical A* and wireless-aware extensions of it, and derive DP-WA*, an optimal, iterative dynamic programming algorithm that incorporates all path gains and distance metrics from the DT, but at significant computational cost. In extensive experiments, we show that SCoTT achieves path gains within 2% of DP-WA* while consistently generating shorter trajectories. Moreover, SCoTT's intermediate outputs can be used to accelerate DP-WA* by reducing its search space, saving up to 62% in execution time. We validate our framework using four VLMs, demonstrating effectiveness across both large and small models, thus making it applicable to a wide range of compact models at low inference cost. We also show the practical viability of our approach by deploying SCoTT as a ROS node within Gazebo simulations. Finally, we discuss data acquisition pipelines, compute requirements, and deployment considerations for VLMs in 6G-enabled DTs, underscoring the potential of natural language interfaces for wireless-aware navigation in real-world applications.

摘要

无线性能约束下的路径规划是机器人导航中的复杂挑战。然而，简单地将此类约束纳入经典规划算法通常会导致难以承受的搜索成本。本文提出SCoTT框架，该框架利用视觉语言模型（VLM），通过数字孪生（DT）提供的无线热力图与射线追踪数据，协同优化平均路径增益与轨迹长度。框架核心是战略思维链任务分解（SCoTT）——一种新颖的提示范式，将穷举搜索问题分解为结构化子任务，每个子任务通过思维链提示求解。为建立强基线，我们比较了经典A算法及其无线感知扩展，并推导出DP-WA算法：这种最优迭代动态规划算法虽能整合DT中所有路径增益与距离度量，但计算成本高昂。大量实验表明，SCoTT在保持路径增益与DP-WA差距小于2%的同时，能持续生成更短轨迹。此外，SCoTT的中间输出可通过缩减搜索空间加速DP-WA，最高节省62%执行时间。我们使用四种VLM验证框架有效性，证明其在不同规模模型上均适用，能以低推理成本适配各类紧凑模型。通过将SCoTT部署为Gazebo仿真中的ROS节点，我们验证了方法的实际可行性。最后，我们探讨了6G使能数字孪生中VLM的数据采集流程、计算需求与部署考量，凸显了自然语言接口在现实世界无线感知导航中的应用潜力。

EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation

Abstract

arXiv:2412.12559v3 Announce Type: replace-cross Abstract: We introduce EXIT, an extractive context compression framework that enhances both the effectiveness and efficiency of retrieval-augmented generation (RAG) in question answering (QA). Current RAG systems often struggle when retrieval models fail to rank the most relevant documents, leading to the inclusion of more context at the expense of latency and accuracy. While abstractive compression methods can drastically reduce token counts, their token-by-token generation process significantly increases end-to-end latency. Conversely, existing extractive methods reduce latency but rely on independent, non-adaptive sentence selection, failing to fully utilize contextual information. EXIT addresses these limitations by classifying sentences from retrieved documents - while preserving their contextual dependencies - enabling parallelizable, context-aware extraction that adapts to query complexity and retrieval quality. Our evaluations on both single-hop and multi-hop QA tasks show that EXIT consistently surpasses existing compression methods and even uncompressed baselines in QA accuracy, while also delivering substantial reductions in inference time and token count. By improving both effectiveness and efficiency, EXIT provides a promising direction for developing scalable, high-quality QA solutions in RAG pipelines. Our code is available at https://github.com/ThisIsHwang/EXIT

摘要

我们提出EXIT，一种提取式上下文压缩框架，旨在同时提升检索增强生成（RAG）在问答（QA）任务中的效能与效率。现有RAG系统常因检索模型未能准确排序最相关文档而陷入困境，导致需要纳入更多上下文信息，却以延迟和准确性为代价。虽然抽象式压缩方法能大幅减少标记数量，但其逐标记生成过程显著增加了端到端延迟。相反，现有提取式方法虽降低延迟，却依赖独立且非自适应的句子选择策略，未能充分利用上下文信息。EXIT通过分类检索文档中的句子（同时保持其上下文依赖关系），实现了可并行化、上下文感知的自适应提取，能够根据查询复杂度和检索质量动态调整。在单跳和多跳QA任务上的评估表明，EXIT在QA准确率上持续超越现有压缩方法甚至未压缩基线，同时显著降低推理时间和标记数量。通过同步提升效能与效率，EXIT为开发可扩展、高质量的RAG管道QA解决方案提供了新方向。代码已开源：https://github.com/ThisIsHwang/EXIT

Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation

Abstract

arXiv:2501.13978v2 Announce Type: replace-cross Abstract: The use of Large Language Models (LLMs) for code generation has gained significant attention in recent years. Existing methods often aim to improve the quality of generated code by incorporating additional contextual information or guidance into input prompts. Many of these approaches adopt sequential reasoning strategies, mimicking human-like step-by-step thinking. However, such strategies may constrain flexibility, as they do not always align with the structured characteristics of programming languages. This paper introduces the Chain of Grounded Objectives (CGO), a method that embeds functional objectives into input prompts to enhance code generation. By leveraging appropriately structured objectives as input and avoiding explicit sequential procedures, CGO adapts effectively to the structured nature of programming tasks. Empirical evaluations demonstrate that CGO effectively enhances code generation, addressing limitations of existing approaches.

摘要

近年来，大型语言模型（LLMs）在代码生成领域的应用受到广泛关注。现有方法通常通过向输入提示中融入额外上下文信息或指导来提升生成代码质量，其中多数采用模拟人类逐步思考的序列化推理策略。然而，这种策略可能限制灵活性，因其并不总能契合编程语言的结构化特性。本文提出'基于目标锚定的推理链'（CGO）方法，通过将功能性目标嵌入输入提示来增强代码生成。CGO利用适当结构化的目标作为输入并避免显式的序列化流程，从而有效适应编程任务的结构化本质。实证评估表明，CGO能有效提升代码生成质量，解决了现有方法的局限性。

A Reality Check on Context Utilisation for Retrieval-Augmented Generation

Abstract

arXiv:2412.17031v2 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) helps address the limitations of parametric knowledge embedded within a language model (LM). In real world settings, retrieved information can vary in complexity, yet most investigations of LM utilisation of context has been limited to synthetic text. We introduce DRUID (Dataset of Retrieved Unreliable, Insufficient and Difficult-to-understand contexts) with real-world queries and contexts manually annotated for stance. The dataset is based on the prototypical task of automated claim verification, for which automated retrieval of real-world evidence is crucial. We compare DRUID to synthetic datasets (CounterFact, ConflictQA) and find that artificial datasets often fail to represent the complexity and diversity of realistically retrieved context. We show that synthetic datasets exaggerate context characteristics rare in real retrieved data, which leads to inflated context utilisation results, as measured by our novel ACU score. Moreover, while previous work has mainly focused on singleton context characteristics to explain context utilisation, correlations between singleton context properties and ACU on DRUID are surprisingly small compared to other properties related to context source. Overall, our work underscores the need for real-world aligned context utilisation studies to represent and improve performance in real-world RAG settings.

摘要

检索增强生成（RAG）技术有助于解决语言模型（LM）中参数化知识的固有局限。在实际应用场景中，检索信息的复杂度存在显著差异，然而现有关于语言模型上下文利用的研究多局限于合成文本。本研究提出DRUID数据集（包含真实世界查询及人工标注立场的手工标注上下文），其特点在于呈现不可靠、不充分和难以理解的现实检索情境。该数据集基于自动声明验证这一典型任务构建，而真实世界证据的自动化检索对该任务至关重要。通过将DRUID与合成数据集（CounterFact、ConflictQA）对比，我们发现人工合成数据集往往无法体现真实检索语境的复杂性和多样性。研究表明，合成数据集夸大了真实检索数据中罕见的语境特征，这导致基于我们提出的新颖ACU指标测量的语境利用率结果被高估。此外，尽管既往研究主要关注单一语境特征来解释上下文利用率，但DRUID数据显示：与语境来源相关的其他属性相比，单一语境属性与ACU得分的相关性出人意料地微弱。本研究最终强调，必须开展符合现实世界特性的语境利用研究，以真实反映并提升实际RAG应用场景中的性能表现。

Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment

Abstract

arXiv:2501.09620v2 Announce Type: replace-cross Abstract: Recent advances in large language models (LLMs) have demonstrated significant progress in performing complex tasks. While Reinforcement Learning from Human Feedback (RLHF) has been effective in aligning LLMs with human preferences, it is susceptible to spurious correlations in reward modeling. Consequently, it often introduces biases-such as length bias, sycophancy, conceptual bias, and discrimination-that hinder the model's ability to capture true causal relationships. To address this, we propose a novel causal reward modeling approach that integrates causality to mitigate these spurious correlations. Our method enforces counterfactual invariance, ensuring reward predictions remain consistent when irrelevant variables are altered. Through experiments on both synthetic and real-world datasets, we show that our approach mitigates various types of spurious correlations effectively, resulting in more reliable and fair alignment of LLMs with human preferences. As a drop-in enhancement to the existing RLHF workflow, our causal reward modeling provides a practical way to improve the trustworthiness and fairness of LLM finetuning.

摘要

大语言模型（LLM）的最新进展在完成复杂任务方面展现出显著成效。尽管基于人类反馈的强化学习（RLHF）能有效对齐模型与人类偏好，但其奖励建模易受虚假相关性影响，常引发长度偏差、迎合倾向、概念偏见及歧视等问题，阻碍模型捕捉真实因果关系。为此，我们提出一种新颖的因果奖励建模方法，通过整合因果性来消除这些虚假关联。该方法强制实现反事实不变性，确保在无关变量变化时奖励预测保持稳定。基于合成与真实数据集的实验表明，我们的方法能有效缓解多种虚假相关性，使LLM与人类偏好的对齐更可靠、公平。作为现有RLHF流程的直接增强方案，因果奖励建模为提高LLM微调的可信度与公平性提供了实用途径。

Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context

Abstract

arXiv:2412.16359v3 Announce Type: replace-cross Abstract: As the AI systems become deeply embedded in social media platforms, we've uncovered a concerning security vulnerability that goes beyond traditional adversarial attacks. It becomes important to assess the risks of LLMs before the general public use them on social media platforms to avoid any adverse impacts. Unlike obvious nonsensical text strings that safety systems can easily catch, our work reveals that human-readable situation-driven adversarial full-prompts that leverage situational context are effective but much harder to detect. We found that skilled attackers can exploit the vulnerabilities in open-source and proprietary LLMs to make a malicious user query safe for LLMs, resulting in generating a harmful response. This raises an important question about the vulnerabilities of LLMs. To measure the robustness against human-readable attacks, which now present a potent threat, our research makes three major contributions. First, we developed attacks that use movie scripts as situational contextual frameworks, creating natural-looking full-prompts that trick LLMs into generating harmful content. Second, we developed a method to transform gibberish adversarial text into readable, innocuous content that still exploits vulnerabilities when used within the full-prompts. Finally, we enhanced the AdvPrompter framework with p-nucleus sampling to generate diverse human-readable adversarial texts that significantly improve attack effectiveness against models like GPT-3.5-Turbo-0125 and Gemma-7b. Our findings show that these systems can be manipulated to operate beyond their intended ethical boundaries when presented with seemingly normal prompts that contain hidden adversarial elements. By identifying these vulnerabilities, we aim to drive the development of more robust safety mechanisms that can withstand sophisticated attacks in real-world applications.

摘要

随着人工智能系统深度嵌入社交媒体平台，我们发现了一种超越传统对抗攻击的安全漏洞。在公众于社交媒体平台广泛使用大语言模型之前评估其风险至关重要，以避免产生负面影响。与安全系统容易识别的明显无意义文本不同，我们的研究表明：利用情境上下文的人类可读情境驱动型完整对抗提示既有效又难以检测。我们发现熟练的攻击者能够利用开源和专有大语言模型中的漏洞，使恶意用户查询对模型呈现安全性，最终导致生成有害响应。这引发了一个关于大语言模型脆弱性的重要问题。为衡量针对当前强威胁的人类可读攻击的鲁棒性，本研究作出三项主要贡献：首先，我们开发了以电影剧本作为情境框架的攻击方法，通过构建自然外观的完整提示诱使模型生成有害内容；其次，我们提出了一种将无意义对抗文本转化为可读无害内容的方法，这些内容在完整提示中仍能利用系统漏洞；最后，我们通过p核采样技术增强AdvPrompter框架，生成多样化的人类可读对抗文本，显著提升对GPT-3.5-Turbo-0125和Gemma-7b等模型的攻击效果。研究结果表明，当遇到包含隐藏对抗元素的看似正常提示时，这些系统可能被操纵而超越其预设伦理边界。通过识别这些漏洞，我们旨在推动开发更具鲁棒性的安全机制，以应对现实应用中的复杂攻击。

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Abstract

arXiv:2501.05874v3 Announce Type: replace-cross Abstract: Retrieval-Augmented Generation (RAG) is a powerful strategy for improving the factual accuracy of models by retrieving external knowledge relevant to queries and incorporating it into the generation process. However, existing approaches primarily focus on text, with some recent advancements considering images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing contextual details more effectively than any other modality. While very recent studies explore the use of videos in response generation, they either predefine query-associated videos without retrieval or convert videos into textual descriptions losing multimodal richness. To tackle these, we introduce VideoRAG, a framework that not only dynamically retrieves videos based on their relevance with queries but also utilizes both visual and textual information. The operation of VideoRAG is powered by recent Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and the seamless integration of retrieved videos jointly with queries for response generation. Also, inspired by that the context size of LVLMs may not be sufficient to process all frames in extremely long videos and not all frames are equally important, we introduce a video frame selection mechanism to extract the most informative subset of frames, along with a strategy to extract textual information from videos (as it can aid the understanding of video content) when their subtitles are not available. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines. Code is available at https://github.com/starsuzi/VideoRAG.

摘要

检索增强生成（RAG）是一种通过检索与查询相关的外部知识并将其融入生成过程来提高模型事实准确性的强大策略。然而，现有方法主要集中于文本，近期一些进展虽考虑了图像，却大多忽略了视频这一多模态知识的丰富来源——其表征上下文细节的能力远超其他模态。尽管最新研究开始探索视频在响应生成中的应用，但它们要么预先定义与查询关联的视频而不进行检索，要么将视频转换为文本描述从而丢失多模态的丰富性。为解决这些问题，我们提出了VideoRAG框架，该框架不仅能根据查询相关性动态检索视频，还能同时利用视觉和文本信息。VideoRAG的运行依托于近期的大规模视频语言模型（LVLM），这些模型可直接处理视频内容以进行检索表示，并将检索到的视频与查询无缝集成以生成响应。此外，受限于LVLM的上下文长度可能无法处理极长视频的所有帧，且并非所有帧都同等重要，我们引入了视频帧选择机制以提取信息量最大的帧子集，并设计了在字幕缺失时从视频中提取文本信息的策略（因其有助于理解视频内容）。实验验证了VideoRAG的有效性，其性能优于相关基线模型。代码发布于https://github.com/starsuzi/VideoRAG。

Fast Large Language Model Collaborative Decoding via Speculation

Abstract

arXiv:2502.01662v2 Announce Type: replace-cross Abstract: Large Language Model (LLM) collaborative decoding techniques improve output quality by combining the outputs of multiple models at each generation step, but they incur high computational costs. In this paper, we introduce Collaborative decoding via Speculation (CoS), a novel framework that accelerates collaborative decoding without compromising performance. Inspired by Speculative Decoding--where a small proposal model generates tokens sequentially, and a larger target model verifies them in parallel, our approach builds on two key insights: (1) the verification distribution can be the combined distribution of both the proposal and target models, and (2) alternating each model as the proposer and verifier can further enhance efficiency. We generalize this method to collaboration among n models and theoretically prove that CoS is never slower than standard collaborative decoding, typically achieving faster speed. Extensive experiments demonstrate CoS is 1.11x-2.23x faster than standard collaborative decoding without compromising generation quality. Our code is available at https://github.com/Kamichanw/CoS/.

摘要

大语言模型（LLM）协作解码技术通过在每个生成步骤组合多个模型的输出来提升输出质量，但会带来高昂的计算成本。本文提出基于推测的协作解码（CoS），这是一种在不损失性能的前提下加速协作解码的新框架。受推测解码启发——其中小型提议模型顺序生成标记，而大型目标模型并行验证这些标记——我们的方法基于两个关键洞见：（1）验证分布可以是提议模型和目标模型的联合分布；（2）交替让每个模型担任提议者和验证者角色可进一步提升效率。我们将此方法推广至n个模型的协作场景，并从理论上证明CoS的速度绝不慢于标准协作解码，通常能实现更快的速度。大量实验表明，在不降低生成质量的前提下，CoS比标准协作解码快1.11至2.23倍。代码已开源：https://github.com/Kamichanw/CoS/。

Tensor Product Attention Is All You Need

Abstract

arXiv:2501.06425v4 Announce Type: replace-cross Abstract: Scaling language models to handle longer input sequences typically necessitates large key-value (KV) caches, resulting in substantial memory overhead during inference. In this paper, we propose Tensor Product Attention (TPA), a novel attention mechanism that uses tensor decompositions to represent queries, keys, and values compactly, substantially shrinking the KV cache size at inference time. By factorizing these representations into contextual low-rank components and seamlessly integrating with Rotary Position Embedding (RoPE), TPA achieves improved model quality alongside memory efficiency. Based on TPA, we introduce the Tensor Product Attention Transformer,(T6), a new model architecture for sequence modeling. Through extensive empirical evaluation on language modeling tasks, we demonstrate that T6 surpasses or matches the performance of standard Transformer baselines, including Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA) across various metrics, including perplexity and a range of established evaluation benchmarks. Notably, TPA's memory efficiency and computational efficiency at the decoding stage enable processing longer sequences under fixed resource constraints, addressing a critical scalability challenge in modern language models. The code is available at https://github.com/tensorgi/T6.

摘要

扩展语言模型以处理更长输入序列通常需要大型键值（KV）缓存，导致推理时产生显著内存开销。本文提出张量积注意力（TPA），一种新型注意力机制，利用张量分解紧凑地表示查询、键和值，从而大幅缩减推理时的KV缓存大小。通过将这些表示分解为上下文低秩分量并无缝集成旋转位置编码（RoPE），TPA在提升模型质量的同时实现了内存高效性。基于TPA，我们提出张量积注意力变压器（T6），一种用于序列建模的新架构。通过在语言建模任务上的广泛实证评估，我们证明T6在困惑度等一系列既定评估基准上，超越或匹配了包括多头注意力（MHA）、多查询注意力（MQA）、分组查询注意力（GQA）和多头潜在注意力（MLA）在内的标准Transformer基线性能。值得注意的是，TPA在解码阶段的内存效率和计算效率使其能在固定资源限制下处理更长序列，解决了现代语言模型的关键可扩展性挑战。代码发布于https://github.com/tensorgi/T6。

MedRAX: Medical Reasoning Agent for Chest X-ray

Abstract

arXiv:2502.02673v2 Announce Type: replace-cross Abstract: Chest X-rays (CXRs) play an integral role in driving critical decisions in disease management and patient care. While recent innovations have led to specialized models for various CXR interpretation tasks, these solutions often operate in isolation, limiting their practical utility in clinical practice. We present MedRAX, the first versatile AI agent that seamlessly integrates state-of-the-art CXR analysis tools and multimodal large language models into a unified framework. MedRAX dynamically leverages these models to address complex medical queries without requiring additional training. To rigorously evaluate its capabilities, we introduce ChestAgentBench, a comprehensive benchmark containing 2,500 complex medical queries across 7 diverse categories. Our experiments demonstrate that MedRAX achieves state-of-the-art performance compared to both open-source and proprietary models, representing a significant step toward the practical deployment of automated CXR interpretation systems. Data and code have been publicly available at https://github.com/bowang-lab/MedRAX

摘要

胸部X光片（CXRs）在疾病管理和患者护理的关键决策中发挥着不可或缺的作用。尽管近期创新推动了针对各类CXR解读任务的专用模型发展，但这些解决方案往往孤立运行，限制了其在临床实践中的实际效用。我们提出MedRAX——首个多功能AI代理，它将最先进的CXR分析工具与多模态大语言模型无缝集成至统一框架。MedRAX无需额外训练即可动态调用这些模型处理复杂医学查询。为严格评估其能力，我们构建了ChestAgentBench综合基准测试，涵盖7大类别共2,500项复杂医学查询。实验表明，相较于开源和专有模型，MedRAX实现了最先进的性能表现，标志着自动化CXR解读系统向实际应用迈出重要一步。数据与代码已公开于https://github.com/bowang-lab/MedRAX

Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives

Abstract

arXiv:2502.04358v2 Announce Type: replace-cross Abstract: Decomposing hard problems into subproblems often makes them easier and more efficient to solve. With large language models (LLMs) crossing critical reliability thresholds for a growing slate of capabilities, there is an increasing effort to decompose systems into sets of LLM-based agents, each of whom can be delegated sub-tasks. However, this decomposition (even when automated) is often intuitive, e.g., based on how a human might assign roles to members of a human team. How close are these role decompositions to optimal? This position paper argues that asymptotic analysis with LLM primitives is needed to reason about the efficiency of such decomposed systems, and that insights from such analysis will unlock opportunities for scaling them. By treating the LLM forward pass as the atomic unit of computational cost, one can separate out the (often opaque) inner workings of a particular LLM from the inherent efficiency of how a set of LLMs are orchestrated to solve hard problems. In other words, if we want to scale the deployment of LLMs to the limit, instead of anthropomorphizing LLMs, asymptotic analysis with LLM primitives should be used to reason about and develop more powerful decompositions of large problems into LLM agents.

摘要

将复杂问题分解为子问题通常能使其更易解决且效率更高。随着大语言模型（LLM）在越来越多能力上跨越关键可靠性阈值，研究者正致力于将系统分解为基于LLM的智能体集合，每个智能体可被分配子任务。然而，这种分解（即使是自动化实现的）往往基于人类直觉，例如模仿人类团队的角色分配方式。这类角色分解距离最优解有多远？本立场论文提出：需要采用基于LLM原语的渐近分析来论证此类分解系统的效率，而由此获得的洞见将释放其规模化应用的潜力。通过将LLM前向传播作为计算成本的原子单元，我们可以将特定LLM（通常不透明）的内部运作机制与多LLM协同解决复杂问题的固有效率分离开来。换言之，若要将LLM部署规模推向极限，不应将LLM拟人化，而应运用基于LLM原语的渐近分析来论证问题分解方案，从而开发出更强大的LLM智能体分解方法。

SPRI: Aligning Large Language Models with Context-Situated Principles

Abstract

arXiv:2502.03397v2 Announce Type: replace-cross Abstract: Aligning Large Language Models to integrate and reflect human values, especially for tasks that demand intricate human oversight, is arduous since it is resource-intensive and time-consuming to depend on human expertise for context-specific guidance. Prior work has utilized predefined sets of rules or principles to steer the behavior of models (Bai et al., 2022; Sun et al., 2023). However, these principles tend to be generic, making it challenging to adapt them to each individual input query or context. In this work, we present Situated-PRInciples (SPRI), a framework requiring minimal or no human effort that is designed to automatically generate guiding principles in real-time for each input query and utilize them to align each response. We evaluate SPRI on three tasks, and show that 1) SPRI can derive principles in a complex domain-specific task that leads to on-par performance as expert-crafted ones; 2) SPRI-generated principles lead to instance-specific rubrics that outperform prior LLM-as-a-judge frameworks; 3) using SPRI to generate synthetic SFT data leads to substantial improvement on truthfulness. We release our code and model generations at https://github.com/honglizhan/SPRI-public.

摘要

使大语言模型与人类价值观对齐并体现这些价值观，尤其是需要复杂人工监督的任务，具有相当难度，因为依赖人类专家提供针对具体情境的指导既耗费资源又耗时。先前的研究采用预定义的规则或原则来引导模型行为（Bai等人，2022；Sun等人，2023），但这些原则往往过于通用，难以适应每个输入查询或具体情境。本研究提出情境化原则框架（SPRI），该框架几乎无需人工干预，能够实时为每个输入查询自动生成指导原则，并利用这些原则对每个响应进行对齐。我们在三项任务上评估SPRI，结果表明：1）SPRI能在复杂领域特定任务中生成与专家手工编写原则效果相当的原则；2）SPRI生成的原则所产生的实例特定评估标准优于现有LLM-as-a-judge框架；3）使用SPRI生成合成监督微调数据可显著提升真实性。代码及模型生成内容发布于https://github.com/honglizhan/SPRI-public。

Jailbreaking to Jailbreak

Abstract

arXiv:2502.09638v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) can be used to red team other models (e.g. jailbreaking) to elicit harmful contents. While prior works commonly employ open-weight models or private uncensored models for doing jailbreaking, as the refusal-training of strong LLMs (e.g. OpenAI o3) refuse to help jailbreaking, our work turn (almost) any black-box LLMs into attackers. The resulting $J_2$ (jailbreaking-to-jailbreak) attackers can effectively jailbreak the safeguard of target models using various strategies, both created by themselves or from expert human red teamers. In doing so, we show their strong but under-researched jailbreaking capabilities. Our experiments demonstrate that 1) prompts used to create $J_2$ attackers transfer across almost all black-box models; 2) an $J_2$ attacker can jailbreak a copy of itself, and this vulnerability develops rapidly over the past 12 months; 3) reasong models, such as Sonnet-3.7, are strong $J_2$ attackers compared to others. For example, when used against the safeguard of GPT-4o, $J_2$ (Sonnet-3.7) achieves 0.975 attack success rate (ASR), which matches expert human red teamers and surpasses the state-of-the-art algorithm-based attacks. Among $J_2$ attackers, $J_2$ (o3) achieves highest ASR (0.605) against Sonnet-3.5, one of the most robust models.

CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Abstract

arXiv:2502.04350v2 Announce Type: replace-cross Abstract: Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.

摘要

现有方法未能有效引导大语言模型（LLM）在文本推理与代码生成之间进行切换，导致符号计算能力未被充分利用。我们提出CodeSteer——一种指导LLM代码/文本生成的有效方法。通过构建包含37项可调复杂度符号任务的综合基准SymBench，并合成12k轮多轮指导/生成轨迹数据集与5.5k组指导对比对，我们采用新设计的多轮监督微调（SFT）和直接偏好优化（DPO）对Llama-3-8B模型进行微调。所得模型CodeSteerLLM结合提出的符号检查器与自应答校验器，能有效指导更大模型的代码/文本生成。将CodeSteer应用于GPT-4o后，其平均性能得分从53.3提升至86.4，在全部37项任务（28项已知、9项未知）中均超越现有最佳LLM OpenAI o1（82.7）、o1-preview（74.8）和DeepSeek R1（76.8）。针对GPT-4o训练的CodeSteer展现出卓越泛化能力，在Claude、Mistral和GPT-3.5上平均带来41.8的性能提升。经CodeSteer指导的LLM能充分运用符号计算，在高度复杂任务中保持强劲性能。

Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages

Abstract

arXiv:2502.10852v2 Announce Type: replace-cross Abstract: While multilingual language models like XLM-R have advanced multilingualism in NLP, they still perform poorly in extremely low-resource languages. This situation is exacerbated by the fact that modern LLMs such as LLaMA and Qwen support far fewer languages than XLM-R, making text generation models non-existent for many languages in the world. To tackle this challenge, we propose a novel framework for adapting multilingual encoders to text generation in extremely low-resource languages. By reusing the weights between the encoder and the decoder, our framework allows the model to leverage the learned semantic space of the encoder, enabling efficient learning and effective generalization in low-resource languages. Applying this framework to four Chinese minority languages, we present XLM-SWCM, and demonstrate its superior performance on various downstream tasks even when compared with much larger models.

摘要

虽然XLM-R等多语言模型推动了NLP领域的多语言发展，但在极低资源语言中仍表现欠佳。现代大语言模型（如LLaMA和Qwen）支持的语言数量远少于XLM-R，这使得全球许多语言缺乏文本生成模型，这一现状进一步加剧了问题的严重性。为应对这一挑战，我们提出了一种新颖框架，可将多语言编码器适配于极低资源语言的文本生成任务。通过复用编码器与解码器间的权重，该框架使模型能够利用编码器已学习的语义空间，从而实现低资源语言下的高效学习与有效泛化。我们将该框架应用于四种中国少数民族语言，构建了XLM-SWCM模型，实验表明即使与规模更大的模型相比，其在多项下游任务中仍展现出卓越性能。

Learning to Reason from Feedback at Test-Time

Abstract

arXiv:2502.15771v2 Announce Type: replace-cross Abstract: Solving complex tasks in a single attempt is challenging for large language models (LLMs). Iterative interaction with the environment and feedback is often required to achieve success, making effective feedback utilization a critical topic. Existing approaches either struggle with length generalization or rely on naive retries without leveraging prior information. In this paper, we introduce FTTT, a novel paradigm that formulates feedback utilization as an optimization problem at test time. Additionally, we propose a learnable test-time optimizer, OpTune, to effectively exploit feedback. Experiments on two LLMs across four reasoning datasets demonstrate that FTTT and OpTune achieve superior scalability and performance.

摘要

解决复杂任务对大型语言模型（LLMs）而言单次尝试具有挑战性。通常需要与环境进行迭代交互并利用反馈才能取得成功，这使得有效利用反馈成为关键议题。现有方法要么难以实现长度泛化，要么仅依赖简单重试而未能利用先验信息。本文提出FTTT这一新颖范式，将反馈利用形式化为测试时的优化问题。此外，我们设计了一种可学习的测试时优化器OpTune以实现高效反馈利用。在两个LLM模型和四个推理数据集上的实验表明，FTTT与OpTune具有卓越的扩展性和性能表现。

STeCa: Step-level Trajectory Calibration for LLM Agent Learning

Abstract

arXiv:2502.14276v2 Announce Type: replace-cross Abstract: Large language model (LLM)-based agents have shown promise in tackling complex tasks by interacting dynamically with the environment. Existing work primarily focuses on behavior cloning from expert demonstrations or preference learning through exploratory trajectory sampling. However, these methods often struggle to address long-horizon tasks, where suboptimal actions accumulate step by step, causing agents to deviate from correct task trajectories. To address this, we highlight the importance of timely calibration and the need to automatically construct calibration trajectories for training agents. We propose Step-Level Trajectory Calibration (STeCa), a novel framework for LLM agent learning. Specifically, STeCa identifies suboptimal actions through a step-level reward comparison during exploration. It constructs calibrated trajectories using LLM-driven reflection, enabling agents to learn from improved decision-making processes. We finally leverage these calibrated trajectories with successful trajectories for reinforced training. Extensive experiments demonstrate that STeCa significantly outperforms existing methods. Further analysis highlights that timely calibration enables agents to complete tasks with greater robustness. Our code and data are available at https://github.com/WangHanLinHenry/STeCa.

摘要

基于大语言模型（LLM）的智能体通过与环境动态交互，在解决复杂任务方面展现出潜力。现有研究主要集中于通过专家演示进行行为克隆，或通过探索性轨迹采样实现偏好学习。然而这些方法往往难以处理长周期任务——次优行动会逐步累积，导致智能体偏离正确任务轨迹。为此，我们强调及时校准的重要性，并提出需要自动构建校准轨迹来训练智能体。本文提出步级轨迹校准框架STeCa，这是一种新型LLM智能体学习框架：首先在探索阶段通过步级奖励比较识别次优行动；继而利用LLM驱动的反思构建校准轨迹，使智能体能从改进的决策过程中学习；最终将这些校准轨迹与成功轨迹共同用于强化训练。大量实验表明STeCa显著优于现有方法。进一步分析证实，及时校准能使智能体以更强鲁棒性完成任务。代码与数据详见https://github.com/WangHanLinHenry/STeCa。

Enhancing Semi-supervised Learning with Zero-shot Pseudolabels

Abstract

arXiv:2502.12584v2 Announce Type: replace-cross Abstract: The high cost of data labeling presents a major barrier to deploying machine learning systems at scale. Semi-supervised learning (SSL) mitigates this challenge by utilizing unlabeled data alongside limited labeled examples, while the emergence of foundation models (FMs) offers powerful zero-shot capabilities that can further reduce labeling cost. However, directly fine-tuning large FMs is often impractical in resource-constrained settings, and na"ively using their pseudo-labels for unlabeled data can degrade performance due to its unreliablity or domain mismatch with target task. In this work, we introduce ZeroMatch, a novel SSL framework that integrates knowledge distillation with consistency-based learning to jointly leverage labeled data, unlabeled data, and pseudo-labels from FMs. ZeroMatch enables training compact student models using only FM inference, making it suitable for low-resource environments such as personal devices with limited compute. Experiments on six vision and language classification benchmarks show that ZeroMatch consistently outperforms standard SSL and zero-shot augmented methods, demonstrating its effectiveness and robustness across a range of foundation model qualities.

摘要

数据标注的高成本是机器学习系统大规模部署的主要障碍。半监督学习（SSL）通过利用未标注数据与有限标注样本相结合的方式缓解了这一挑战，而基础模型（FM）的出现进一步提供了可降低标注成本的强大零样本能力。然而，在资源受限场景中直接微调大型基础模型往往不切实际，且因其伪标签的不可靠性或与目标任务存在领域失配，直接使用这些伪标注可能导致性能下降。本研究提出ZeroMatch——一种将知识蒸馏与基于一致性的学习相融合的新型半监督学习框架，能够协同利用标注数据、未标注数据及基础模型生成的伪标签。该框架仅需通过基础模型推理即可训练紧凑的学生模型，适用于个人设备等计算资源受限的低资源环境。在六个视觉与语言分类基准测试上的实验表明，ZeroMatch在各类基础模型质量条件下均稳定优于标准半监督学习及零样本增强方法，验证了其有效性与鲁棒性。

DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing

Abstract

arXiv:2502.11647v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are widely applied in decision making, but their deployment is threatened by jailbreak attacks, where adversarial users manipulate model behavior to bypass safety measures. Existing defense mechanisms, such as safety fine-tuning and model editing, either require extensive parameter modifications or lack precision, leading to performance degradation on general tasks, which is unsuitable to post-deployment safety alignment. To address these challenges, we propose DELMAN (Dynamic Editing for LLMs JAilbreak DefeNse), a novel approach leveraging direct model editing for precise, dynamic protection against jailbreak attacks. DELMAN directly updates a minimal set of relevant parameters to neutralize harmful behaviors while preserving the model's utility. To avoid triggering a safe response in benign context, we incorporate KL-divergence regularization to ensure the updated model remains consistent with the original model when processing benign queries. Experimental results demonstrate that DELMAN outperforms baseline methods in mitigating jailbreak attacks while preserving the model's utility, and adapts seamlessly to new attack instances, providing a practical and efficient solution for post-deployment model protection.

摘要

大型语言模型（LLMs）在决策制定中被广泛应用，但其部署面临越狱攻击的威胁，即对抗性用户通过操纵模型行为绕过安全防护措施。现有防御机制（如安全微调和模型编辑）要么需要大量参数修改，要么缺乏精确性，导致通用任务性能下降，因此不适用于部署后的安全对齐。为解决这些问题，我们提出DELMAN（动态编辑式LLMs越狱防御），这是一种利用直接模型编辑实现精准动态防护的新方法。DELMAN通过直接更新最小相关参数集来消除有害行为，同时保持模型功能。为避免在良性语境下触发安全响应，我们引入KL散度正则化机制，确保更新后的模型在处理良性查询时与原始模型保持一致。实验结果表明，DELMAN在抵御越狱攻击的同时能更好地保持模型功能，且能无缝适应新型攻击实例，为部署后模型保护提供了实用高效的解决方案。

GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning

Abstract

arXiv:2502.12913v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) fine-tuning technologies have achieved remarkable results. However, traditional LLM fine-tuning approaches face significant challenges: they require large Floating Point (FP) computation, raising privacy concerns when handling sensitive data, and are impractical for resource-constrained edge devices. While Parameter-Efficient Fine-Tuning (PEFT) techniques reduce trainable parameters, their reliance on floating-point arithmetic creates fundamental incompatibilities with edge hardware. In this work, we introduce a novel framework for on-device LLM fine-tuning that eliminates the need for floating-point operations in both inference and training, named GSQ-Tuning. At its core is the Group-Shared Exponents Integer format, which efficiently represents model parameters in integer format using shared exponents among parameter groups. When combined with LoRA-like adapters, this enables fully integer-based fine-tuning that is both memory and compute efficient. We demonstrate that our approach achieves accuracy comparable to BF16-based fine-tuning while significantly reducing 1.85x memory usage. Moreover, compared to FP8, our method can reduce 5x power consumption and 11x chip area with same performance, making large-scale model adaptation feasible on edge devices.

摘要

大型语言模型（LLM）微调技术已取得显著成果，但传统微调方法面临重大挑战：需要大量浮点（FP）计算，处理敏感数据时存在隐私风险，且难以在资源受限的边缘设备上实施。虽然参数高效微调（PEFT）技术减少了可训练参数，但其依赖浮点运算与边缘硬件存在根本性不兼容。本研究提出一种名为GSQ-Tuning的新型设备端LLM微调框架，该框架在推理和训练中均无需浮点运算。其核心是组共享指数整数格式，通过参数组间共享指数，以整数形式高效表示模型参数。当与类LoRA适配器结合时，可实现完全基于整数的高效内存与计算微调。实验表明，该方法在保持与BF16微调相当精度的同时，显著降低1.85倍内存占用。此外，在同等性能下，相较FP8方案可减少5倍功耗和11倍芯片面积，使边缘设备上的大规模模型适配成为可能。

ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation

Abstract

arXiv:2502.15543v2 Announce Type: replace-cross Abstract: Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence. However, they remain susceptible to unfaithful generation, where outputs contradict retrieved context despite its relevance and accuracy. Existing approaches aiming to improve faithfulness primarily focus on enhancing the utilization of external context, but often overlook the persistent influence of internal parametric knowledge during generation. In this work, we investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases. Building on this insight, we propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge. To evaluate our approach, we introduce CoFaithfulQA, a benchmark specifically designed to evaluate faithfulness in scenarios where internal knowledge conflicts with accurate external evidence. Experimental results show that ParamMute significantly enhances faithfulness across both CoFaithfulQA and the established ConFiQA benchmark, achieving substantial reductions in reliance on parametric memory. These findings underscore the importance of mitigating internal knowledge dominance and provide a new direction for improving LLM trustworthiness in RAG. All code will be released via GitHub.

摘要

大型语言模型（LLMs）与检索增强生成（RAG）的结合通过将输出建立在外部证据基础上，提高了事实准确性。然而，这些模型仍存在不忠实生成的问题，即输出与检索到的上下文相矛盾，尽管上下文具有相关性和准确性。现有方法主要致力于增强对外部上下文的使用，却往往忽视了生成过程中内部参数化知识的持续影响。本研究探讨了不忠实生成的内部机制，并发现其中一组中深层前馈网络（FFNs）在此类情况下被过度激活。基于这一发现，我们提出了通过FFN抑制实现参数化知识静默（ParamMute）的框架，该框架通过抑制与不忠实性相关的FFNs激活，并将模型校准至检索知识，从而提升上下文忠实度。为评估该方法，我们引入了CoFaithfulQA基准，专门用于评估内部知识与准确外部证据冲突场景下的忠实性。实验结果表明，ParamMute在CoFaithfulQA和既有ConFiQA基准上均显著提升了忠实性，大幅降低了对参数化记忆的依赖。这些发现凸显了削弱内部知识主导地位的重要性，并为提升RAG中LLM的可信度提供了新方向。所有代码将通过GitHub发布。

ExpandR: Teaching Dense Retrievers Beyond Queries with LLM Guidance

Abstract

arXiv:2502.17057v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have demonstrated significant potential in enhancing dense retrieval through query augmentation. However, most existing methods treat the LLM and the retriever as separate modules, overlooking the alignment between generation and ranking objectives. In this work, we propose ExpandR, a unified LLM-augmented dense retrieval framework that jointly optimizes both the LLM and the retriever. ExpandR employs the LLM to generate semantically rich query expansions, which are leveraged to enhance the retriever's training. Simultaneously, the LLM is trained using Direct Preference Optimization (DPO), guided by a carefully designed reward function that balances retrieval effectiveness and generation consistency. This joint optimization paradigm enables mutual adaptation between the LLM and the retriever, resulting in query expansions that are both informative and well-suited for retrieval. Experimental results on multiple benchmarks show that ExpandR consistently outperforms strong baselines, achieving more than a 5% improvement in retrieval performance. All codes are available at https://github.com/NEUIR/ExpandR.

摘要

大型语言模型（LLMs）通过查询增强在提升密集检索性能方面展现出显著潜力。然而，现有方法大多将LLM与检索器视为独立模块，忽略了生成目标与排序目标之间的对齐。本研究提出ExpandR——一个联合优化LLM与检索器的统一框架。该框架利用LLM生成语义丰富的查询扩展项以增强检索器训练，同时采用基于直接偏好优化（DPO）的LLM训练策略，通过精心设计的奖励函数平衡检索效果与生成一致性。这种联合优化范式实现了LLM与检索器的相互适配，使生成的查询扩展项既富含信息又契合检索需求。多基准测试表明，ExpandR始终优于强基线模型，检索性能提升超5%。全部代码已开源：https://github.com/NEUIR/ExpandR。

Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference

Abstract

arXiv:2503.04779v4 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs' capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs' reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.

摘要

大型语言模型（LLMs）在自动化编程任务中的应用日益广泛。然而，LLMs在程序语义推理方面的能力仍未得到充分研究，存在巨大的探索空间。本文提出FormalBench——一个旨在评估LLMs程序语义推理能力的综合基准，重点通过合成形式化程序规约以辅助验证程序正确性的任务进行考察。该任务不仅需要对所有可能的程序执行路径进行全面推理，还要求生成符合形式语法和语义的精确、语法正确的表达式。基于此基准，我们评估了LLMs在合成一致且完整规约方面的能力。研究发现：LLMs能较好处理简单控制流，但在处理复杂结构（尤其是循环）时表现欠佳，即使采用高级提示技术亦如此；此外，LLMs对语义保持变换的鲁棒性有限。我们还总结了常见失败模式，并设计了自修复提示策略，将成功率提升了25%。

Toward universal steering and monitoring of AI models

Abstract

arXiv:2502.03708v2 Announce Type: replace-cross Abstract: Modern AI models contain much of human knowledge, yet understanding of their internal representation of this knowledge remains elusive. Characterizing the structure and properties of this representation will lead to improvements in model capabilities and development of effective safeguards. Building on recent advances in feature learning, we develop an effective, scalable approach for extracting linear representations of general concepts in large-scale AI models (language models, vision-language models, and reasoning models). We show how these representations enable model steering, through which we expose vulnerabilities, mitigate misaligned behaviors, and improve model capabilities. Additionally, we demonstrate that concept representations are remarkably transferable across human languages and combinable to enable multi-concept steering. Through quantitative analysis across hundreds of concepts, we find that newer, larger models are more steerable and steering can improve model capabilities beyond standard prompting. We show how concept representations are effective for monitoring misaligned content (hallucinations, toxic content). We demonstrate that predictive models built using concept representations are more accurate for monitoring misaligned content than using models that judge outputs directly. Together, our results illustrate the power of using internal representations to map the knowledge in AI models, advance AI safety, and improve model capabilities.

摘要

现代人工智能模型蕴含了大量人类知识，但其内部知识表征机制仍未被充分理解。揭示这种表征的结构与特性将有助于提升模型能力并发展有效保障措施。基于特征学习领域的最新进展，我们开发了一种可扩展的高效方法，用于从大规模AI模型（包括语言模型、视觉语言模型和推理模型）中提取通用概念的线性表征。研究表明这些表征能实现模型调控，通过该机制我们揭示了模型漏洞、纠正了行为偏差并提升了模型性能。此外，实验证明概念表征具有显著的语言迁移性，并能通过组合实现多概念协同调控。通过对数百个概念的定量分析，我们发现更新、更大规模的模型具有更强的可调控性，且调控效果能超越标准提示方法。研究还展示了概念表征在内容合规监测（包括幻觉内容和有害内容）中的有效性。实证表明，基于概念表征构建的预测模型比直接评判输出的模型具有更高的监测准确率。这些研究成果共同证明了利用内部表征来测绘AI模型知识图谱、推进AI安全治理以及增强模型能力的巨大潜力。

Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions

Abstract

arXiv:2503.22353v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have shown remarkable capabilities across various tasks, but their deployment in high-stake domains requires consistent performance across multiple interaction rounds. This paper introduces a comprehensive framework for evaluating and improving LLM response consistency, making three key contributions. First, we propose a novel Position-Weighted Consistency (PWC) score that captures both the importance of early-stage stability and recovery patterns in multi-turn interactions. Second, we present a carefully curated benchmark dataset spanning diverse domains and difficulty levels, specifically designed to evaluate LLM consistency under various challenging follow-up scenarios. Third, we introduce Confidence-Aware Response Generation (CARG), a framework that significantly improves response stability by incorporating model confidence signals into the generation process. Empirical results demonstrate that CARG significantly improves response stability without sacrificing accuracy, underscoring its potential for reliable LLM deployment in critical applications.

摘要

大型语言模型（LLMs）在各种任务中展现出卓越能力，但其在高风险领域的部署需要确保多轮交互中的性能一致性。本文提出一个评估与提升LLM响应一致性的综合框架，包含三项关键贡献：首先，我们提出新颖的"位置加权一致性"（PWC）评分指标，该指标能同时捕捉多轮交互中早期稳定性和后续恢复模式的重要性；其次，我们构建了一个精心设计的跨领域、多难度级别的基准数据集，专门用于评估LLMs在各类挑战性后续场景下的表现一致性；最后，我们开发了"置信度感知响应生成"（CARG）框架，通过将模型置信度信号融入生成过程，显著提升了响应稳定性。实验结果表明，CARG在保持准确性的同时显著改善了响应稳定性，这为LLMs在关键应用中的可靠部署提供了重要支持。

Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach

Abstract

arXiv:2503.18085v2 Announce Type: replace-cross Abstract: Temporal information extraction from unstructured text is essential for contextualizing events and deriving actionable insights, particularly in the medical domain. We address the task of extracting clinical events and their temporal relations using the well-studied I2B2 2012 Temporal Relations Challenge corpus. This task is inherently challenging due to complex clinical language, long documents, and sparse annotations. We introduce GRAPHTREX, a novel method integrating span-based entity-relation extraction, clinical large pre-trained language models (LPLMs), and Heterogeneous Graph Transformers (HGT) to capture local and global dependencies. Our HGT component facilitates information propagation across the document through innovative global landmarks that bridge distant entities. Our method improves the state-of-the-art with 5.5% improvement in the tempeval $F_1$ score over the previous best and up to 8.9% improvement on long-range relations, which presents a formidable challenge. We further demonstrate generalizability by establishing a strong baseline on the E3C corpus. This work not only advances temporal information extraction but also lays the groundwork for improved diagnostic and prognostic models through enhanced temporal reasoning.

摘要

从非结构化文本中提取时间信息对于事件情境化和获取可操作见解至关重要，尤其在医疗领域。我们基于经过充分研究的I2B2 2012时间关系挑战语料库，研究临床事件及其时间关系的提取任务。该任务因复杂的临床语言、长文档和稀疏标注而具有固有挑战性。我们提出GRAPHTREX方法，整合基于跨度的实体关系提取、临床大型预训练语言模型（LPLMs）和异构图变换器（HGT），以捕捉局部和全局依赖关系。我们的HGT组件通过创新的全局地标节点促进信息在文档中的传播，从而连接远距离实体。该方法将tempeval $F_1$ 分数较现有最佳水平提升5.5%，在具有挑战性的长距离关系任务上最高提升8.9%，实现了技术突破。我们通过在E3C语料库建立强基线进一步验证了方法的泛化能力。该研究不仅推动了时间信息提取技术的发展，还通过增强时间推理能力为改进诊断和预后模型奠定了基础。

MiZero: The Shadowy Defender Against Text Style Infringements

Abstract

arXiv:2504.00035v2 Announce Type: replace-cross Abstract: In-Context Learning (ICL) and efficient fine-tuning methods significantly enhanced the efficiency of applying Large Language Models (LLMs) to downstream tasks. However, they also raise concerns about the imitation and infringement of personal creative data. Current methods for data copyright protection primarily focuses on content security but lacks effectiveness in protecting the copyrights of text styles. In this paper, we introduce a novel implicit zero-watermarking scheme, namely MiZero. This scheme establishes a precise watermark domain to protect the copyrighted style, surpassing traditional watermarking methods that distort the style characteristics. Specifically, we employ LLMs to extract condensed-lists utilizing the designed instance delimitation mechanism. These lists guide MiZero in generating the watermark. Extensive experiments demonstrate that MiZero effectively verifies text style copyright ownership against AI imitation.

摘要

上下文学习（ICL）与高效微调方法显著提升了将大语言模型（LLMs）应用于下游任务的效率，但也引发了对个人创意数据被模仿与侵权的担忧。现有数据版权保护方法主要关注内容安全，但在保护文本风格版权方面缺乏有效性。本文提出一种新型隐式零水印方案MiZero，该方案通过建立精确的水印域来保护受版权保护的风格，超越了传统会扭曲风格特征的水印方法。具体而言，我们利用设计的实例界定机制，通过LLMs提取浓缩列表，这些列表指导MiZero生成水印。大量实验表明，MiZero能有效验证针对AI模仿的文本风格版权归属。

Enhancing Retrieval for ESGLLM via ESG-CID -- A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS

Abstract

arXiv:2503.10674v2 Announce Type: replace-cross Abstract: Climate change has intensified the need for transparency and accountability in organizational practices, making Environmental, Social, and Governance (ESG) reporting increasingly crucial. Frameworks like the Global Reporting Initiative (GRI) and the new European Sustainability Reporting Standards (ESRS) aim to standardize ESG reporting, yet generating comprehensive reports remains challenging due to the considerable length of ESG documents and variability in company reporting styles. To facilitate ESG report automation, Retrieval-Augmented Generation (RAG) systems can be employed, but their development is hindered by a lack of labeled data suitable for training retrieval models. In this paper, we leverage an underutilized source of weak supervision -- the disclosure content index found in past ESG reports -- to create a comprehensive dataset, ESG-CID, for both GRI and ESRS standards. By extracting mappings between specific disclosure requirements and corresponding report sections, and refining them using a Large Language Model as a judge, we generate a robust training and evaluation set. We benchmark popular embedding models on this dataset and show that fine-tuning BERT-based models can outperform commercial embeddings and leading public models, even under temporal data splits for cross-report style transfer from GRI to ESRS. Data: https://huggingface.co/datasets/airefinery/esg_cid_retrieval

摘要

气候变化加剧了对组织实践透明度和问责制的需求，使得环境、社会和治理（ESG）报告变得愈发重要。全球报告倡议组织（GRI）和欧洲新出台的可持续发展报告标准（ESRS）等框架旨在规范ESG报告，但由于ESG文件篇幅冗长且企业报告风格各异，生成全面报告仍具挑战性。为推进ESG报告自动化，可采用检索增强生成（RAG）系统，但其发展受限于缺乏适合训练检索模型的标注数据。本文利用一种未被充分挖掘的弱监督源——历史ESG报告中的披露内容索引，构建了涵盖GRI与ESRS标准的综合数据集ESG-CID。通过提取特定披露要求与对应报告章节的映射关系，并采用大语言模型作为评判器进行优化，我们生成了高质量的训练与评估集。基于该数据集对主流嵌入模型进行基准测试，结果表明：即使在GRI向ESRS跨报告风格迁移的时间序列数据分割条件下，基于BERT架构的微调模型性能仍优于商业嵌入方案和领先的公共模型。数据地址：https://huggingface.co/datasets/airefinery/esg_cid_retrieval

LEAVS: An LLM-based Labeler for Abdominal CT Supervision

Abstract

arXiv:2503.13330v2 Announce Type: replace-cross Abstract: Extracting structured labels from radiology reports has been employed to create vision models to simultaneously detect several types of abnormalities. However, existing works focus mainly on the chest region. Few works have been investigated on abdominal radiology reports due to more complex anatomy and a wider range of pathologies in the abdomen. We propose LEAVS (Large language model Extractor for Abdominal Vision Supervision). This labeler can annotate the certainty of presence and the urgency of seven types of abnormalities for nine abdominal organs on CT radiology reports. To ensure broad coverage, we chose abnormalities that encompass most of the finding types from CT reports. Our approach employs a specialized chain-of-thought prompting strategy for a locally-run LLM using sentence extraction and multiple-choice questions in a tree-based decision system. We demonstrate that the LLM can extract several abnormality types across abdominal organs with an average F1 score of 0.89, significantly outperforming competing labelers and humans. Additionally, we show that extraction of urgency labels achieved performance comparable to human annotations. Finally, we demonstrate that the abnormality labels contain valuable information for training a single vision model that classifies several organs as normal or abnormal. We release our code and structured annotations for a public CT dataset containing over 1,000 CT volumes.

摘要

从放射学报告中提取结构化标签已被用于创建能同时检测多种异常类型的视觉模型。然而，现有研究主要集中于胸部区域。由于腹部解剖结构更为复杂且病理类型更为广泛，针对腹部放射学报告的研究较少。我们提出LEAVS（面向腹部视觉监督的大语言模型提取器），该标注器能够对CT放射学报告中九种腹部器官的七类异常存在确定性及紧急程度进行标注。为确保广泛覆盖性，我们选择的异常类型涵盖了CT报告中大多数发现类型。该方法采用专门设计的思维链提示策略，通过基于树的决策系统结合句子提取与多选题，在本地运行的大语言模型上实现标注。实验表明，该大语言模型能够以平均F1分数0.89提取腹部器官的多种异常类型，显著优于现有标注工具及人工标注。此外，紧急程度标签的提取性能达到与人工标注相当的水平。最后，我们证明这些异常标签可为训练单一视觉模型提供有价值的信息，该模型能对多个器官进行正常或异常分类。我们公开了代码及包含1,000余例CT扫描的公共数据集的结构化标注。

Agentic Knowledgeable Self-awareness

Abstract

arXiv:2504.03553v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have achieved considerable performance across various agentic planning tasks. However, traditional agent planning approaches adopt a "flood irrigation" methodology that indiscriminately injects gold trajectories, external feedback, and domain knowledge into agent models. This practice overlooks the fundamental human cognitive principle of situational self-awareness during decision-making-the ability to dynamically assess situational demands and strategically employ resources during decision-making. We propose agentic knowledgeable self-awareness to address this gap, a novel paradigm enabling LLM-based agents to autonomously regulate knowledge utilization. Specifically, we propose KnowSelf, a data-centric approach that applies agents with knowledgeable self-awareness like humans. Concretely, we devise a heuristic situation judgement criterion to mark special tokens on the agent's self-explored trajectories for collecting training data. Through a two-stage training process, the agent model can switch between different situations by generating specific special tokens, achieving optimal planning effects with minimal costs. Our experiments demonstrate that KnowSelf can outperform various strong baselines on different tasks and models with minimal use of external knowledge. Code is available at https://github.com/zjunlp/KnowSelf.

摘要

大型语言模型（LLMs）在各种代理规划任务中展现出卓越性能。然而，传统代理规划方法采用"漫灌式"策略，不加区分地向代理模型注入黄金轨迹、外部反馈和领域知识。这种做法忽视了人类决策过程中情境自我认知的基本认知原则——即动态评估情境需求并策略性调配资源的能力。为填补这一空白，我们提出具有知识性自我认知的代理新范式，使基于LLM的代理能自主调节知识运用。具体而言，我们提出KnowSelf这一以数据为中心的方法，赋予代理类人的知识性自我认知能力。通过设计启发式情境判断标准，我们在代理自主探索的轨迹上标注特殊标记以收集训练数据。经过两阶段训练过程，代理模型可通过生成特定特殊标记在不同情境间切换，以最小成本实现最优规划效果。实验表明，KnowSelf在多种任务和模型上仅需极少外部知识即可超越各强基线。代码详见https://github.com/zjunlp/KnowSelf。

DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation

Abstract

arXiv:2503.10452v2 Announce Type: replace-cross Abstract: The rapid advancement of large language models (LLMs) has significantly improved their performance in code generation tasks. However, existing code benchmarks remain static, consisting of fixed datasets with predefined problems. This makes them vulnerable to memorization during training, where LLMs recall specific test cases instead of generalizing to new problems, leading to data contamination and unreliable evaluation results. To address these issues, we introduce DynaCode, a dynamic, complexity-aware benchmark that overcomes the limitations of static datasets. DynaCode evaluates LLMs systematically using a complexity-aware metric, incorporating both code complexity and call-graph structures. DynaCode achieves large-scale diversity, generating up to 189 million unique nested code problems across four distinct levels of code complexity, referred to as units, and 16 types of call graphs. Results on 12 latest LLMs show an average performance drop of 16.8% to 45.7% compared to MBPP+, a static code generation benchmark, with performance progressively decreasing as complexity increases. This demonstrates DynaCode's ability to effectively differentiate LLMs. Additionally, by leveraging call graphs, we gain insights into LLM behavior, particularly their preference for handling subfunction interactions within nested code. Our benchmark and evaluation code are available at https://github.com/HWH-2000/DynaCode.

摘要

大型语言模型（LLMs）的快速发展显著提升了其在代码生成任务中的表现。然而，现有的代码基准测试仍保持静态，由预定义问题的固定数据集构成。这使得它们在训练过程中容易受到记忆效应的影响——LLMs会回忆特定测试用例而非泛化至新问题，从而导致数据污染和不可靠的评估结果。为解决这些问题，我们提出DynaCode：一种动态的、复杂度感知的基准测试框架，可突破静态数据集的局限性。该框架采用复杂度感知度量方法系统评估LLMs，同时考量代码复杂度和调用图结构。DynaCode实现了大规模多样性，能在四个被称为"单元"的不同代码复杂度层级及16种调用图类型下，生成多达1.89亿个独特嵌套代码问题。在12个最新LLMs上的测试结果表明，相较于静态代码生成基准MBPP+，模型性能平均下降16.8%至45.7%，且性能随复杂度提升呈渐进式下降，这验证了DynaCode有效区分LLMs的能力。此外，通过调用图分析，我们深入理解了LLMs的行为特征，尤其是其处理嵌套代码中子函数交互的偏好。本基准测试框架及评估代码已发布于https://github.com/HWH-2000/DynaCode。

A Large Language Model-Enabled Control Architecture for Dynamic Resource Capability Exploration in Multi-Agent Manufacturing Systems
- Abstract
- 摘要
Profiling and optimization of multi-card GPU machine learning jobs
- Abstract
- 摘要
Be.FM: Open Foundation Models for Human Behavior
- Abstract
- 摘要
Decomposing Elements of Problem Solving: What "Math" Does RL Teach?
- Abstract
- 摘要
Design and testing of an agent chatbot supporting decision making with public transport data
- Abstract
- 摘要
TailorSQL: An NL2SQL System Tailored to Your Query Workload
- Abstract
- 摘要
Revisiting Multi-Agent Debate as Test-Time Scaling: A Systematic Study of Conditional Effectiveness
- Abstract
- 摘要
Speeding up Model Loading with fastsafetensors
- Abstract
- 摘要
Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
- Abstract
- 摘要
Infi-MMR: Curriculum-based Unlocking Multimodal Reasoning via Phased Reinforcement Learning in Multimodal Small Language Models
- Abstract
- 摘要
MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning
- Abstract
- 摘要
Second Opinion Matters: Towards Adaptive Clinical AI via the Consensus of Expert Model Ensemble
- Abstract
- 摘要
Understanding the Information Propagation Effects of Communication Topologies in LLM-based Multi-Agent Systems
- Abstract
- 摘要
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
- Abstract
- 摘要
Autoformalization in the Era of Large Language Models: A Survey
- Abstract
- 摘要
KVzip: Query-Agnostic KV Cache Compression with Context Reconstruction
- Abstract
- 摘要
Case-Based Reasoning Enhances the Predictive Power of LLMs in Drug-Drug Interaction
- Abstract
- 摘要
Socratic-PRMBench: Benchmarking Process Reward Models with Systematic Reasoning Patterns
- Abstract
- 摘要
Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters
- Abstract
- 摘要
EVOREFUSE: Evolutionary Prompt Optimization for Evaluation and Mitigation of LLM Over-Refusal to Pseudo-Malicious Instructions
- Abstract
- 摘要
CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring
- Abstract
- 摘要
TRAP: Targeted Redirecting of Agentic Preferences
- Abstract
- 摘要
SafeScientist: Toward Risk-Aware Scientific Discoveries by LLM Agents
- Abstract
- 摘要
Fortune: Formula-Driven Reinforcement Learning for Symbolic Table Reasoning in Language Models
- Abstract
- 摘要
Data-to-Dashboard: Multi-Agent LLM Framework for Insightful Visualization in Enterprise Analytics
- Abstract
- 摘要
ZeroGUI: Automating Online GUI Learning at Zero Human Cost
- Abstract
- 摘要
Let's Reason Formally: Natural-Formal Hybrid Reasoning Enhances LLM's Math Capability
- Abstract
- 摘要
In Dialogue with Intelligence: Rethinking Large Language Models as Collective Knowledge
- Abstract
- 摘要
Pre-Training Curriculum for Multi-Token Prediction in Language Models
- Abstract
- 摘要
Training Language Models to Generate Quality Code with Program Analysis Feedback
- Abstract
- 摘要
A Tool for Generating Exceptional Behavior Tests With Large Language Models
- Abstract
- 摘要
Automated Essay Scoring Incorporating Annotations from Automated Feedback Systems
- Abstract
- 摘要
First Steps Towards Overhearing LLM Agents: A Case Study With Dungeons & Dragons Gameplay
- Abstract
- 摘要
Dynamic Task Adaptation for Multi-Robot Manufacturing Systems with Large Language Models
- Abstract
- 摘要
What Has Been Lost with Synthetic Evaluation?
- Abstract
- 摘要
Operationalizing CaMeL: Strengthening LLM Defenses for Enterprise Deployment
- Abstract
- 摘要
BugWhisperer: Fine-Tuning LLMs for SoC Hardware Vulnerability Detection
- Abstract
- 摘要
Permissioned LLMs: Enforcing Access Control in Large Language Models
- Abstract
- 摘要
Can LLMs Deceive CLIP? Benchmarking Adversarial Compositionality of Pre-trained Multimodal Representation via Text Updates
- Abstract
- 摘要
Generative Social Choice: The Next Generation
- Abstract
- 摘要
HiLDe: Intentional Code Generation via Human-in-the-Loop Decoding
- Abstract
- 摘要
WorkForceAgent-R1: Incentivizing Reasoning Capability in LLM-based Web Agents via Reinforcement Learning
- Abstract
- 摘要
MermaidFlow: Redefining Agentic Workflow Generation via Safety-Constrained Evolutionary Programming
- Abstract
- 摘要
NegVQA: Can Vision Language Models Understand Negation?
- Abstract
- 摘要
Scalable Parameter and Memory Efficient Pretraining for LLM: Recent Algorithmic Advances and Benchmarking
- Abstract
- 摘要
Exploring Scaling Laws for EHR Foundation Models
- Abstract
- 摘要
Unraveling LoRA Interference: Orthogonal Subspaces for Robust Model Merging
- Abstract
- 摘要
Model-Preserving Adaptive Rounding
- Abstract
- 摘要
OWL: Probing Cross-Lingual Recall of Memorized Texts via World Literature
- Abstract
- 摘要
A Practical Approach for Building Production-Grade Conversational Agents with Workflow Graphs
- Abstract
- 摘要
AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models
- Abstract
- 摘要
Verify-in-the-Graph: Entity Disambiguation Enhancement for Complex Claim Verification with Interactive Graph Representation
- Abstract
- 摘要
Context Robust Knowledge Editing for Language Models
- Abstract
- 摘要
Augment or Not? A Comparative Study of Pure and Augmented Large Language Model Recommenders
- Abstract
- 摘要
From Token to Action: State Machine Reasoning to Mitigate Overthinking in Information Retrieval
- Abstract
- 摘要
ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
- Abstract
- 摘要
Decom-Renorm-Merge: Model Merging on the Right Space Improves Multitasking
- Abstract
- 摘要
ExpeTrans: LLMs Are Experiential Transfer Learners
- Abstract
- 摘要
Cross-Task Experiential Learning on LLM-based Multi-Agent Collaboration
- Abstract
- 摘要
OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software
- Abstract
- 摘要
VERINA: Benchmarking Verifiable Code Generation
- Abstract
- 摘要
Unsupervised Word-level Quality Estimation for Machine Translation Through the Lens of Annotators (Dis)agreement
- Abstract
- 摘要
MCTSr-Zero: Self-Reflective Psychological Counseling Dialogues Generation via Principles and Adaptive Exploration
- Abstract
- 摘要
Does Machine Unlearning Truly Remove Model Knowledge? A Framework for Auditing Unlearning in LLMs
- Abstract
- 摘要
Deep Retrieval at CheckThat! 2025: Identifying Scientific Papers from Implicit Social Media Mentions via Hybrid Retrieval and Re-Ranking
- Abstract
- 摘要
Disrupting Vision-Language Model-Driven Navigation Services via Adversarial Object Fusion
- Abstract
- 摘要
Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective
- Abstract
- 摘要
The Arabic AI Fingerprint: Stylometric Analysis and Detection of Large Language Models Text
- Abstract
- 摘要
Accelerating RLHF Training with Reward Variance Increase
- Abstract
- 摘要
Matryoshka Model Learning for Improved Elastic Student Models
- Abstract
- 摘要
Towards Reward Fairness in RLHF: From a Resource Allocation Perspective
- Abstract
- 摘要
Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization
- Abstract
- 摘要
How Does Response Length Affect Long-Form Factuality
- Abstract
- 摘要
Can Large Language Models Challenge CNNS in Medical Image Analysis?
- Abstract
- 摘要
Cognitive Guardrails for Open-World Decision Making in Autonomous Drone Swarms
- Abstract
- 摘要
Active Layer-Contrastive Decoding Reduces Hallucination in Large Language Model Generation
- Abstract
- 摘要
Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
- Abstract
- 摘要
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
- Abstract
- 摘要
Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering
- Abstract
- 摘要
Engineering Serendipity through Recommendations of Items with Atypical Aspects
- Abstract
- 摘要
GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents
- Abstract
- 摘要
Jigsaw-R1: A Study of Rule-based Visual Reinforcement Learning with Jigsaw Puzzles
- Abstract
- 摘要
VF-Eval: Evaluating Multimodal LLMs for Generating Feedback on AIGC Videos
- Abstract
- 摘要
SC-LoRA: Balancing Efficient Fine-tuning and Knowledge Preservation via Subspace-Constrained LoRA
- Abstract
- 摘要
ML-Agent: Reinforcing LLM Agents for Autonomous Machine Learning Engineering
- Abstract
- 摘要
Bounded Rationality for LLMs: Satisficing Alignment at Inference-Time
- Abstract
- 摘要
DeepTheorem: Advancing LLM Reasoning for Theorem Proving Through Natural Language and Reinforcement Learning
- Abstract
- 摘要
ATLAS: Learning to Optimally Memorize the Context at Test Time
- Abstract
- 摘要
From Chat Logs to Collective Insights: Aggregative Question Answering
- Abstract
- 摘要
Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence
- Abstract
- 摘要
Language Agents with Reinforcement Learning for Strategic Play in the Werewolf Game
- Abstract
- 摘要
Large Language Model Driven Recommendation
- Abstract
OmniBal: Towards Fast Instruction-Tuning for Vision-Language Models via Omniverse Computation Balance
- Abstract
- 摘要
Leveraging Large Language Models for Relevance Judgments in Legal Case Retrieval
- Abstract
- 摘要
Differential Information: An Information-Theoretic Perspective on Preference Optimization
- Abstract
- 摘要
MegaAgent: A Large-Scale Autonomous LLM-based Multi-Agent System Without Predefined SOPs
- Abstract
- 摘要
On the Resilience of LLM-Based Multi-Agent Collaboration with Faulty Agents
- Abstract
- 摘要
Emergent social conventions and collective bias in LLM populations
- Abstract
- 摘要
Towards Logically Sound Natural Language Reasoning with Logic-Enhanced Language Model Agents
- Abstract
- 摘要
CodePMP: Scalable Preference Model Pretraining for Large Language Model Reasoning
- Abstract
- 摘要
GIVE: Structured Reasoning of Large Language Models with Knowledge Graph Inspired Veracity Extrapolation
- Abstract
- 摘要
Generative Agents for Multi-Agent Autoformalization of Interaction Scenarios
- Abstract
- 摘要
Dataset Featurization: Uncovering Natural Language Features through Unsupervised Data Reconstruction
- Abstract
- 摘要
ReAgent: Reversible Multi-Agent Reasoning for Knowledge-Enhanced Multi-Hop QA
- Abstract
- 摘要
BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent
- Abstract
- 摘要
am-ELO: A Stable Framework for Arena-based LLM Evaluation
- Abstract
- 摘要
HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking
- Abstract
- 摘要
Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
- Abstract
- 摘要
ReflectionCoder: Learning from Reflection Sequence for Enhanced One-off Code Generation
- Abstract
- 摘要
X-TURING: Towards an Enhanced and Efficient Turing Test for Long-Term Dialogue Agents
- Abstract
- 摘要
GWQ: Gradient-Aware Weight Quantization for Large Language Models
- Abstract
- 摘要
Personality-Guided Code Generation Using Large Language Models
- Abstract
- 摘要
SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression
- Abstract
- 摘要
SequentialBreak: Large Language Models Can be Fooled by Embedding Jailbreak Prompts into Sequential Prompt Chains
- Abstract
- 摘要
Graph of Records: Boosting Retrieval Augmented Generation for Long-context Summarization with Graphs
- Abstract
- 摘要
SimGRAG: Leveraging Similar Subgraphs for Knowledge Graphs Driven Retrieval-Augmented Generation
- Abstract
- 摘要
SCoTT: Strategic Chain-of-Thought Tasking for Wireless-Aware Robot Navigation in Digital Twins
- Abstract
- 摘要
EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation
- Abstract
- 摘要
Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation
- Abstract
- 摘要
A Reality Check on Context Utilisation for Retrieval-Augmented Generation
- Abstract
- 摘要
Beyond Reward Hacking: Causal Rewards for Large Language Model Alignment
- Abstract
- 摘要
Human-Readable Adversarial Prompts: An Investigation into LLM Vulnerabilities Using Situational Context
- Abstract
- 摘要
VideoRAG: Retrieval-Augmented Generation over Video Corpus
- Abstract
- 摘要
Fast Large Language Model Collaborative Decoding via Speculation
- Abstract
- 摘要
Tensor Product Attention Is All You Need
- Abstract
- 摘要
MedRAX: Medical Reasoning Agent for Chest X-ray
- Abstract
- 摘要
Position: Scaling LLM Agents Requires Asymptotic Analysis with LLM Primitives
- Abstract
- 摘要
SPRI: Aligning Large Language Models with Context-Situated Principles
- Abstract
- 摘要
Jailbreaking to Jailbreak
- Abstract
CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance
- Abstract
- 摘要
Multilingual Encoder Knows more than You Realize: Shared Weights Pretraining for Extremely Low-Resource Languages
- Abstract
- 摘要
Learning to Reason from Feedback at Test-Time
- Abstract
- 摘要
STeCa: Step-level Trajectory Calibration for LLM Agent Learning
- Abstract
- 摘要
Enhancing Semi-supervised Learning with Zero-shot Pseudolabels
- Abstract
- 摘要
DELMAN: Dynamic Defense Against Large Language Model Jailbreaking with Model Editing
- Abstract
- 摘要
GSQ-Tuning: Group-Shared Exponents Integer in Fully Quantized Training for LLMs On-Device Fine-tuning
- Abstract
- 摘要
ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation
- Abstract
- 摘要
ExpandR: Teaching Dense Retrievers Beyond Queries with LLM Guidance
- Abstract
- 摘要
Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference
- Abstract
- 摘要
Toward universal steering and monitoring of AI models
- Abstract
- 摘要
Firm or Fickle? Evaluating Large Language Models Consistency in Sequential Interactions
- Abstract
- 摘要
Temporal Relation Extraction in Clinical Texts: A Span-based Graph Transformer Approach
- Abstract
- 摘要
MiZero: The Shadowy Defender Against Text Style Infringements
- Abstract
- 摘要
Enhancing Retrieval for ESGLLM via ESG-CID -- A Disclosure Content Index Finetuning Dataset for Mapping GRI and ESRS
- Abstract
- 摘要
LEAVS: An LLM-based Labeler for Abdominal CT Supervision
- Abstract
- 摘要
Agentic Knowledgeable Self-awareness
- Abstract
- 摘要
DynaCode: A Dynamic Complexity-Aware Code Benchmark for Evaluating Large Language Models in Code Generation
- Abstract
- 摘要

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要